[Scikit-learn-general] issue with pipeline always giving same results

Andrew Howe Thu, 27 Aug 2015 06:23:03 -0700

Ok, thanks Joel, I understand that now.  I'll just do my own bootstrapping
then.


Andrew

On Thu, Aug 27, 2015 at 4:10 PM, <
scikit-learn-general-requ...@lists.sourceforge.net> wrote:

> Send Scikit-learn-general mailing list submissions to
>         scikit-learn-general@lists.sourceforge.net
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-general-requ...@lists.sourceforge.net
>
> You can reach the person managing the list at
>         scikit-learn-general-ow...@lists.sourceforge.net
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Scikit-learn-general digest..."
>
>
> Today's Topics:
>
>    1. Re: issue with pipeline always giving same        results (Andrew
>       Howe) (Andrew Howe)
>    2. Re: issue with pipeline always giving same        results (Joel
> Nothman)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 27 Aug 2015 16:00:28 +0300
> From: Andrew Howe <ahow...@gmail.com>
> Subject: Re: [Scikit-learn-general] issue with pipeline always giving
>         same    results (Andrew Howe)
> To: scikit-learn-general@lists.sourceforge.net
> Message-ID:
>         <CANnYi3QubpZAGWrbzJ_26xnDL0J=_kmzxwRz7fVi=
> gtzxyz...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Sorry for the red herring, but I've realized it's not an issue with
> Pipeline.  The code below has the same behavior:
>
> nw = dat.datetime.now()
> rndstat = nw.hour*3600+nw.minute*60+nw.second
> twenty_train = fetch_20newsgroups(subset='train', categories=categories,
> random_state = rndstat, shuffle=True, download_if_missing=False)
> twenty_test = fetch_20newsgroups(subset='test', categories=categories,
> random_state = rndstat, shuffle=True, download_if_missing=False)
>
> cv = CountVectorizer()
> X_train = cv.fit_transform(twenty_train.data)
> clf = MultinomialNB().fit(X_train,twenty_train.target)
> pred = clf.predict(cv.transform(twenty_test.data))
> print(sum(pred == twenty_test.target)/len(twenty_test.target))
>
> Andrew
>
> On Thu, Aug 27, 2015 at 3:45 PM, <
> scikit-learn-general-requ...@lists.sourceforge.net> wrote:
>
> > Send Scikit-learn-general mailing list submissions to
> >         scikit-learn-general@lists.sourceforge.net
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> > or, via email, send a message with subject or body 'help' to
> >         scikit-learn-general-requ...@lists.sourceforge.net
> >
> > You can reach the person managing the list at
> >         scikit-learn-general-ow...@lists.sourceforge.net
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Scikit-learn-general digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Tests against reference implementations,  speed regression
> >       tests (Andreas Mueller)
> >    2. Turning on sample weights for     linear_model.LogisticRegression
> >       (Valentin Stolbunov)
> >    3. Re: Turning on sample weights for
> >       linear_model.LogisticRegression (Joel Nothman)
> >    4. Re: Turning on sample weights for
> >       linear_model.LogisticRegression (Andy)
> >    5. Re: K-SVD implementation (????? ??????? (Alexey Umnov))
> >    6. issue with pipeline always giving same    results (Andrew Howe)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 25 Aug 2015 13:06:11 -0400
> > From: Andreas Mueller <t3k...@gmail.com>
> > Subject: [Scikit-learn-general] Tests against reference
> >         implementations,        speed regression tests
> > To: scikit-learn-general@lists.sourceforge.net
> > Message-ID: <55dca083.6020...@gmail.com>
> > Content-Type: text/plain; charset=utf-8; format=flowed
> >
> > Hey all.
> >
> > I will soon have some student dev resources and I'm pondering how to
> > best use them.
> > Apart from the hundreds of issues, one thing I was thinking about adding
> > is more tests against reference implementations,
> > and having speed regression tests.
> >
> > For the reference implementations, we could hard-code the results of
> > algorithms into the tests. That is done for some
> > algorithms, but only very few. It would guard us against "obvious"
> > functionality bugs, which still show up from time to time.
> >
> > For speed regression tests, it has happened that things got slower, in
> > particular with innocent looking things like input validation.
> > I think it would be good to have some tests that ensure that we don't
> > get too much slower.
> > I'm not entirely sure how do to that, though.
> > I know Vlad put some effort into a continuous benchmarking suite, but I
> > think since then there have been several
> > efforts to log speed of implementations in a consistent way, and we
> > might want to look into these.
> >
> > Do you think that these are interesting issues to work on, or do you
> > think there are more pressing ones?
> >
> > We still have a lot to do on the API side, though I'm a bit hesitant to
> > give that to new devs.
> >
> > Cheers,
> > Andy
> >
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Wed, 26 Aug 2015 19:15:53 -0500
> > From: Valentin Stolbunov <valentin.stolbu...@gmail.com>
> > Subject: [Scikit-learn-general] Turning on sample weights for
> >         linear_model.LogisticRegression
> > To: scikit-learn-general@lists.sourceforge.net
> > Message-ID:
> >         <CAM5iThP3YExbMt8HFXvkRA5uZSY-1p1qRCJrefR0=Kf=
> > cyz...@mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hello everyone,
> >
> > I noticed that two of the three solvers in the logistic regression module
> > (newton-cg and lbfgs) accept sample weights, but this feature is hidden
> > away from users by not recognizing sample_weight as parameter in .ft().
> > Instead, sample_weight is set to ones (line 555 of logistic.py). To the
> > best of my knowledge this is because the default solver (liblinear) does
> > not support them?
> >
> > Could we instead allow sample_weight as a parameter (default None) and
> set
> > them to ones only if the chosen solver is liblinear (with appropriate
> > documentation notes - similar to the way the L1 penalty is supported only
> > by liblinear)?
> >
> > I realize that SGDClassifier's .fit() accepts sample weights and the loss
> > can be set to 'log', however this isn't exactly the same.
> >
> > What do you think?
> >
> > Valentin
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Thu, 27 Aug 2015 11:29:40 +1000
> > From: Joel Nothman <joel.noth...@gmail.com>
> > Subject: Re: [Scikit-learn-general] Turning on sample weights for
> >         linear_model.LogisticRegression
> > To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net>
> > Message-ID:
> >         <CAAkaFLU2=CV8kBBOWJz1-=
> > rt+nen56g7k5jfxb-3nygln-o...@mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I agree. I suspect this was an unintentional omission, in fact.
> >
> > Apart from which, sample_weight support in liblinear could be merged from
> > https://github.com/scikit-learn/scikit-learn/pull/2784 which is dormant,
> > and merely needs some core contributors to show interest in merging it...
> >
> > On 27 August 2015 at 10:15, Valentin Stolbunov <
> > valentin.stolbu...@gmail.com
> > > wrote:
> >
> > > Hello everyone,
> > >
> > > I noticed that two of the three solvers in the logistic regression
> module
> > > (newton-cg and lbfgs) accept sample weights, but this feature is hidden
> > > away from users by not recognizing sample_weight as parameter in .ft().
> > > Instead, sample_weight is set to ones (line 555 of logistic.py). To the
> > > best of my knowledge this is because the default solver (liblinear)
> does
> > > not support them?
> > >
> > > Could we instead allow sample_weight as a parameter (default None) and
> > set
> > > them to ones only if the chosen solver is liblinear (with appropriate
> > > documentation notes - similar to the way the L1 penalty is supported
> only
> > > by liblinear)?
> > >
> > > I realize that SGDClassifier's .fit() accepts sample weights and the
> loss
> > > can be set to 'log', however this isn't exactly the same.
> > >
> > > What do you think?
> > >
> > > Valentin
> > >
> > >
> > >
> >
> ------------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > Scikit-learn-general mailing list
> > > Scikit-learn-general@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> > >
> > >
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> >
> > ------------------------------
> >
> > Message: 4
> > Date: Wed, 26 Aug 2015 22:59:44 -0400
> > From: Andy <t3k...@gmail.com>
> > Subject: Re: [Scikit-learn-general] Turning on sample weights for
> >         linear_model.LogisticRegression
> > To: scikit-learn-general@lists.sourceforge.net
> > Message-ID: <55de7d20.5060...@gmail.com>
> > Content-Type: text/plain; charset=windows-1252; format=flowed
> >
> > On 08/26/2015 09:29 PM, Joel Nothman wrote:
> > > I agree. I suspect this was an unintentional omission, in fact.
> > >
> > > Apart from which, sample_weight support in liblinear could be merged
> > > from https://github.com/scikit-learn/scikit-learn/pull/2784 which is
> > > dormant, and merely needs some core contributors to show interest in
> > > merging it...
> > >
> > "merely" ;)
> >
> >
> >
> > ------------------------------
> >
> > Message: 5
> > Date: Thu, 27 Aug 2015 15:28:08 +0300
> > From: ????? ??????? (Alexey Umnov)      <alexe...@yandex.ru>
> > Subject: Re: [Scikit-learn-general] K-SVD implementation
> > To: "scikit-learn-general@lists.sourceforge.net"
> >         <scikit-learn-general@lists.sourceforge.net>
> > Message-ID: <699781440678...@web24h.yandex.ru>
> > Content-Type: text/plain; charset="us-ascii"
> >
> > An HTML attachment was scrubbed...
> >
> > ------------------------------
> >
> > Message: 6
> > Date: Thu, 27 Aug 2015 15:44:38 +0300
> > From: Andrew Howe <ahow...@gmail.com>
> > Subject: [Scikit-learn-general] issue with pipeline always giving same
> >         results
> > To: scikit-learn-general@lists.sourceforge.net
> > Message-ID:
> >         <
> > cannyi3rv7zp3k5jqufo3+v4eysccxxfurn15w6brzky02a_...@mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I'm working through the tutorial, and also experimenting kind of on my
> > own.  I'm on the text analysis example, and am curious about the relative
> > merits of analyzing by word frequency, relative frequency, and adjusted
> > relative frequency.  Using the 20 newsgroups data, I've built a set of
> > pipelines within a cross_validation loop; the important part of the code
> is
> > here:
> >
> > # get the data
> > nw = dat.datetime.now()
> > rndstat = nw.hour*3600+nw.minute*60+nw.second
> > twenty_train = fetch_20newsgroups(subset='train', categories=categories,
> > random_state = rndstat, shuffle=True, download_if_missing=False)
> > twenty_test = fetch_20newsgroups(subset='test', categories=categories,
> > random_state = rndstat, shuffle=True, download_if_missing=False)
> >
> > # first with raw counts
> > text_clf = Pipeline([('vect', CountVectorizer()), ('clf',
> > MultinomialNB())])
> > text_clf.fit(twenty_train.data,twenty_train.target)
> > pred = text_clf.predict(twenty_test.data)
> > test_ccrs[mccnt,0] = sum(pred ==
> > twenty_test.target)/len(twenty_test.target)
> >
> > The issue is that everytime I run this, though I've confirmed the data
> > sampled is different, the value in test_ccrs is *always* the same.  Am I
> > missing something?
> >
> > Thanks!
> > Andrew
> >
> > <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> > J. Andrew Howe, PhD
> > Editor-in-Chief, European Journal of Mathematical Sciences
> > Executive Editor, European Journal of Pure and Applied Mathematics
> > www.andrewhowe.com
> > http://www.linkedin.com/in/ahowe42
> > https://www.researchgate.net/profile/John_Howe12/
> > I live to learn, so I can learn to live. - me
> > <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> >
> > ------------------------------
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> > End of Scikit-learn-general Digest, Vol 67, Issue 44
> > ****************************************************
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
> Message: 2
> Date: Thu, 27 Aug 2015 23:10:33 +1000
> From: Joel Nothman <joel.noth...@gmail.com>
> Subject: Re: [Scikit-learn-general] issue with pipeline always giving
>         same    results
> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net>
> Message-ID:
>         <
> caakaflvyflznfmqqmy0mu1pbn52xkonmeszd9frfjakngyg...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> The randomisation only changes the order of the data, not the set of data
> points.
>
> On 27 August 2015 at 22:44, Andrew Howe <ahow...@gmail.com> wrote:
>
> > I'm working through the tutorial, and also experimenting kind of on my
> > own.  I'm on the text analysis example, and am curious about the relative
> > merits of analyzing by word frequency, relative frequency, and adjusted
> > relative frequency.  Using the 20 newsgroups data, I've built a set of
> > pipelines within a cross_validation loop; the important part of the code
> is
> > here:
> >
> > # get the data
> > nw = dat.datetime.now()
> > rndstat = nw.hour*3600+nw.minute*60+nw.second
> > twenty_train = fetch_20newsgroups(subset='train', categories=categories,
> > random_state = rndstat, shuffle=True, download_if_missing=False)
> > twenty_test = fetch_20newsgroups(subset='test', categories=categories,
> > random_state = rndstat, shuffle=True, download_if_missing=False)
> >
> > # first with raw counts
> > text_clf = Pipeline([('vect', CountVectorizer()), ('clf',
> > MultinomialNB())])
> > text_clf.fit(twenty_train.data,twenty_train.target)
> > pred = text_clf.predict(twenty_test.data)
> > test_ccrs[mccnt,0] = sum(pred ==
> > twenty_test.target)/len(twenty_test.target)
> >
> > The issue is that everytime I run this, though I've confirmed the data
> > sampled is different, the value in test_ccrs is *always* the same.  Am I
> > missing something?
> >
> > Thanks!
> > Andrew
> >
> > <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> > J. Andrew Howe, PhD
> > Editor-in-Chief, European Journal of Mathematical Sciences
> > Executive Editor, European Journal of Pure and Applied Mathematics
> > www.andrewhowe.com
> > http://www.linkedin.com/in/ahowe42
> > https://www.researchgate.net/profile/John_Howe12/
> > I live to learn, so I can learn to live. - me
> > <~~~~~~~~~~~~~~~~~~~~~~~~~~~>
> >
> >
> >
> ------------------------------------------------------------------------------
> >
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
>
> ------------------------------------------------------------------------------
>
>
> ------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> End of Scikit-learn-general Digest, Vol 67, Issue 45
> ****************************************************
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] issue with pipeline always giving same results

Reply via email to