Ok, thanks Joel, I understand that now. I'll just do my own bootstrapping then.
Andrew On Thu, Aug 27, 2015 at 4:10 PM, < scikit-learn-general-requ...@lists.sourceforge.net> wrote: > Send Scikit-learn-general mailing list submissions to > scikit-learn-general@lists.sourceforge.net > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > or, via email, send a message with subject or body 'help' to > scikit-learn-general-requ...@lists.sourceforge.net > > You can reach the person managing the list at > scikit-learn-general-ow...@lists.sourceforge.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Scikit-learn-general digest..." > > > Today's Topics: > > 1. Re: issue with pipeline always giving same results (Andrew > Howe) (Andrew Howe) > 2. Re: issue with pipeline always giving same results (Joel > Nothman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 27 Aug 2015 16:00:28 +0300 > From: Andrew Howe <ahow...@gmail.com> > Subject: Re: [Scikit-learn-general] issue with pipeline always giving > same results (Andrew Howe) > To: scikit-learn-general@lists.sourceforge.net > Message-ID: > <CANnYi3QubpZAGWrbzJ_26xnDL0J=_kmzxwRz7fVi= > gtzxyz...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Sorry for the red herring, but I've realized it's not an issue with > Pipeline. The code below has the same behavior: > > nw = dat.datetime.now() > rndstat = nw.hour*3600+nw.minute*60+nw.second > twenty_train = fetch_20newsgroups(subset='train', categories=categories, > random_state = rndstat, shuffle=True, download_if_missing=False) > twenty_test = fetch_20newsgroups(subset='test', categories=categories, > random_state = rndstat, shuffle=True, download_if_missing=False) > > cv = CountVectorizer() > X_train = cv.fit_transform(twenty_train.data) > clf = MultinomialNB().fit(X_train,twenty_train.target) > pred = clf.predict(cv.transform(twenty_test.data)) > print(sum(pred == twenty_test.target)/len(twenty_test.target)) > > Andrew > > On Thu, Aug 27, 2015 at 3:45 PM, < > scikit-learn-general-requ...@lists.sourceforge.net> wrote: > > > Send Scikit-learn-general mailing list submissions to > > scikit-learn-general@lists.sourceforge.net > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > or, via email, send a message with subject or body 'help' to > > scikit-learn-general-requ...@lists.sourceforge.net > > > > You can reach the person managing the list at > > scikit-learn-general-ow...@lists.sourceforge.net > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of Scikit-learn-general digest..." > > > > > > Today's Topics: > > > > 1. Tests against reference implementations, speed regression > > tests (Andreas Mueller) > > 2. Turning on sample weights for linear_model.LogisticRegression > > (Valentin Stolbunov) > > 3. Re: Turning on sample weights for > > linear_model.LogisticRegression (Joel Nothman) > > 4. Re: Turning on sample weights for > > linear_model.LogisticRegression (Andy) > > 5. Re: K-SVD implementation (????? ??????? (Alexey Umnov)) > > 6. issue with pipeline always giving same results (Andrew Howe) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Tue, 25 Aug 2015 13:06:11 -0400 > > From: Andreas Mueller <t3k...@gmail.com> > > Subject: [Scikit-learn-general] Tests against reference > > implementations, speed regression tests > > To: scikit-learn-general@lists.sourceforge.net > > Message-ID: <55dca083.6020...@gmail.com> > > Content-Type: text/plain; charset=utf-8; format=flowed > > > > Hey all. > > > > I will soon have some student dev resources and I'm pondering how to > > best use them. > > Apart from the hundreds of issues, one thing I was thinking about adding > > is more tests against reference implementations, > > and having speed regression tests. > > > > For the reference implementations, we could hard-code the results of > > algorithms into the tests. That is done for some > > algorithms, but only very few. It would guard us against "obvious" > > functionality bugs, which still show up from time to time. > > > > For speed regression tests, it has happened that things got slower, in > > particular with innocent looking things like input validation. > > I think it would be good to have some tests that ensure that we don't > > get too much slower. > > I'm not entirely sure how do to that, though. > > I know Vlad put some effort into a continuous benchmarking suite, but I > > think since then there have been several > > efforts to log speed of implementations in a consistent way, and we > > might want to look into these. > > > > Do you think that these are interesting issues to work on, or do you > > think there are more pressing ones? > > > > We still have a lot to do on the API side, though I'm a bit hesitant to > > give that to new devs. > > > > Cheers, > > Andy > > > > > > > > ------------------------------ > > > > Message: 2 > > Date: Wed, 26 Aug 2015 19:15:53 -0500 > > From: Valentin Stolbunov <valentin.stolbu...@gmail.com> > > Subject: [Scikit-learn-general] Turning on sample weights for > > linear_model.LogisticRegression > > To: scikit-learn-general@lists.sourceforge.net > > Message-ID: > > <CAM5iThP3YExbMt8HFXvkRA5uZSY-1p1qRCJrefR0=Kf= > > cyz...@mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > Hello everyone, > > > > I noticed that two of the three solvers in the logistic regression module > > (newton-cg and lbfgs) accept sample weights, but this feature is hidden > > away from users by not recognizing sample_weight as parameter in .ft(). > > Instead, sample_weight is set to ones (line 555 of logistic.py). To the > > best of my knowledge this is because the default solver (liblinear) does > > not support them? > > > > Could we instead allow sample_weight as a parameter (default None) and > set > > them to ones only if the chosen solver is liblinear (with appropriate > > documentation notes - similar to the way the L1 penalty is supported only > > by liblinear)? > > > > I realize that SGDClassifier's .fit() accepts sample weights and the loss > > can be set to 'log', however this isn't exactly the same. > > > > What do you think? > > > > Valentin > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > > > ------------------------------ > > > > Message: 3 > > Date: Thu, 27 Aug 2015 11:29:40 +1000 > > From: Joel Nothman <joel.noth...@gmail.com> > > Subject: Re: [Scikit-learn-general] Turning on sample weights for > > linear_model.LogisticRegression > > To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net> > > Message-ID: > > <CAAkaFLU2=CV8kBBOWJz1-= > > rt+nen56g7k5jfxb-3nygln-o...@mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > I agree. I suspect this was an unintentional omission, in fact. > > > > Apart from which, sample_weight support in liblinear could be merged from > > https://github.com/scikit-learn/scikit-learn/pull/2784 which is dormant, > > and merely needs some core contributors to show interest in merging it... > > > > On 27 August 2015 at 10:15, Valentin Stolbunov < > > valentin.stolbu...@gmail.com > > > wrote: > > > > > Hello everyone, > > > > > > I noticed that two of the three solvers in the logistic regression > module > > > (newton-cg and lbfgs) accept sample weights, but this feature is hidden > > > away from users by not recognizing sample_weight as parameter in .ft(). > > > Instead, sample_weight is set to ones (line 555 of logistic.py). To the > > > best of my knowledge this is because the default solver (liblinear) > does > > > not support them? > > > > > > Could we instead allow sample_weight as a parameter (default None) and > > set > > > them to ones only if the chosen solver is liblinear (with appropriate > > > documentation notes - similar to the way the L1 penalty is supported > only > > > by liblinear)? > > > > > > I realize that SGDClassifier's .fit() accepts sample weights and the > loss > > > can be set to 'log', however this isn't exactly the same. > > > > > > What do you think? > > > > > > Valentin > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Scikit-learn-general mailing list > > > Scikit-learn-general@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > > > ------------------------------ > > > > Message: 4 > > Date: Wed, 26 Aug 2015 22:59:44 -0400 > > From: Andy <t3k...@gmail.com> > > Subject: Re: [Scikit-learn-general] Turning on sample weights for > > linear_model.LogisticRegression > > To: scikit-learn-general@lists.sourceforge.net > > Message-ID: <55de7d20.5060...@gmail.com> > > Content-Type: text/plain; charset=windows-1252; format=flowed > > > > On 08/26/2015 09:29 PM, Joel Nothman wrote: > > > I agree. I suspect this was an unintentional omission, in fact. > > > > > > Apart from which, sample_weight support in liblinear could be merged > > > from https://github.com/scikit-learn/scikit-learn/pull/2784 which is > > > dormant, and merely needs some core contributors to show interest in > > > merging it... > > > > > "merely" ;) > > > > > > > > ------------------------------ > > > > Message: 5 > > Date: Thu, 27 Aug 2015 15:28:08 +0300 > > From: ????? ??????? (Alexey Umnov) <alexe...@yandex.ru> > > Subject: Re: [Scikit-learn-general] K-SVD implementation > > To: "scikit-learn-general@lists.sourceforge.net" > > <scikit-learn-general@lists.sourceforge.net> > > Message-ID: <699781440678...@web24h.yandex.ru> > > Content-Type: text/plain; charset="us-ascii" > > > > An HTML attachment was scrubbed... > > > > ------------------------------ > > > > Message: 6 > > Date: Thu, 27 Aug 2015 15:44:38 +0300 > > From: Andrew Howe <ahow...@gmail.com> > > Subject: [Scikit-learn-general] issue with pipeline always giving same > > results > > To: scikit-learn-general@lists.sourceforge.net > > Message-ID: > > < > > cannyi3rv7zp3k5jqufo3+v4eysccxxfurn15w6brzky02a_...@mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > I'm working through the tutorial, and also experimenting kind of on my > > own. I'm on the text analysis example, and am curious about the relative > > merits of analyzing by word frequency, relative frequency, and adjusted > > relative frequency. Using the 20 newsgroups data, I've built a set of > > pipelines within a cross_validation loop; the important part of the code > is > > here: > > > > # get the data > > nw = dat.datetime.now() > > rndstat = nw.hour*3600+nw.minute*60+nw.second > > twenty_train = fetch_20newsgroups(subset='train', categories=categories, > > random_state = rndstat, shuffle=True, download_if_missing=False) > > twenty_test = fetch_20newsgroups(subset='test', categories=categories, > > random_state = rndstat, shuffle=True, download_if_missing=False) > > > > # first with raw counts > > text_clf = Pipeline([('vect', CountVectorizer()), ('clf', > > MultinomialNB())]) > > text_clf.fit(twenty_train.data,twenty_train.target) > > pred = text_clf.predict(twenty_test.data) > > test_ccrs[mccnt,0] = sum(pred == > > twenty_test.target)/len(twenty_test.target) > > > > The issue is that everytime I run this, though I've confirmed the data > > sampled is different, the value in test_ccrs is *always* the same. Am I > > missing something? > > > > Thanks! > > Andrew > > > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > J. Andrew Howe, PhD > > Editor-in-Chief, European Journal of Mathematical Sciences > > Executive Editor, European Journal of Pure and Applied Mathematics > > www.andrewhowe.com > > http://www.linkedin.com/in/ahowe42 > > https://www.researchgate.net/profile/John_Howe12/ > > I live to learn, so I can learn to live. - me > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > > > ------------------------------ > > > > > > > ------------------------------------------------------------------------------ > > > > > > ------------------------------ > > > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > End of Scikit-learn-general Digest, Vol 67, Issue 44 > > **************************************************** > > > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 2 > Date: Thu, 27 Aug 2015 23:10:33 +1000 > From: Joel Nothman <joel.noth...@gmail.com> > Subject: Re: [Scikit-learn-general] issue with pipeline always giving > same results > To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net> > Message-ID: > < > caakaflvyflznfmqqmy0mu1pbn52xkonmeszd9frfjakngyg...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > The randomisation only changes the order of the data, not the set of data > points. > > On 27 August 2015 at 22:44, Andrew Howe <ahow...@gmail.com> wrote: > > > I'm working through the tutorial, and also experimenting kind of on my > > own. I'm on the text analysis example, and am curious about the relative > > merits of analyzing by word frequency, relative frequency, and adjusted > > relative frequency. Using the 20 newsgroups data, I've built a set of > > pipelines within a cross_validation loop; the important part of the code > is > > here: > > > > # get the data > > nw = dat.datetime.now() > > rndstat = nw.hour*3600+nw.minute*60+nw.second > > twenty_train = fetch_20newsgroups(subset='train', categories=categories, > > random_state = rndstat, shuffle=True, download_if_missing=False) > > twenty_test = fetch_20newsgroups(subset='test', categories=categories, > > random_state = rndstat, shuffle=True, download_if_missing=False) > > > > # first with raw counts > > text_clf = Pipeline([('vect', CountVectorizer()), ('clf', > > MultinomialNB())]) > > text_clf.fit(twenty_train.data,twenty_train.target) > > pred = text_clf.predict(twenty_test.data) > > test_ccrs[mccnt,0] = sum(pred == > > twenty_test.target)/len(twenty_test.target) > > > > The issue is that everytime I run this, though I've confirmed the data > > sampled is different, the value in test_ccrs is *always* the same. Am I > > missing something? > > > > Thanks! > > Andrew > > > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > J. Andrew Howe, PhD > > Editor-in-Chief, European Journal of Mathematical Sciences > > Executive Editor, European Journal of Pure and Applied Mathematics > > www.andrewhowe.com > > http://www.linkedin.com/in/ahowe42 > > https://www.researchgate.net/profile/John_Howe12/ > > I live to learn, so I can learn to live. - me > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > > ------------------------------------------------------------------------------ > > > ------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > End of Scikit-learn-general Digest, Vol 67, Issue 45 > **************************************************** >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general