Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 45, Issue 23

Carlos Aspillaga Fri, 18 Oct 2013 08:32:11 -0700

Hi, guys,

Thanks for your response.
To start I will implement an algorithm for feature selection called LSE
(Least Squares Estimation) which is really useful for speeding-up the
process when you are using near real time applications.
The main idea behind this algorithm is to find a subset of features based
on their capacities to reproduce projections on other feature spaces (like
PCA, or PLS-R).
It is useful because experimentally you get almost the same feature-space
as PCA or PLS-R (and the same performance, as consequence) but using
feature selection (which is much faster if the feature extraction process
is slow)


The reference of this algorithm is
Mao, K. Identifying critical variables of principal components for
unsupervised feature selection Systems, Man, and Cybernetics, Part B:
Cybernetics, IEEE Transactions on, 2005, 35, 339-344
(http://goo.gl/IFX9SO<http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1408062&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F3477%2F30535%2F01408062>
)




2013/10/17 <scikit-learn-general-requ...@lists.sourceforge.net>

> Send Scikit-learn-general mailing list submissions to
>         scikit-learn-general@lists.sourceforge.net
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-general-requ...@lists.sourceforge.net
>
> You can reach the person managing the list at
>         scikit-learn-general-ow...@lists.sourceforge.net
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Scikit-learn-general digest..."
>
>
> Today's Topics:
>
>    1. Contributing code (Carlos Aspillaga)
>    2. Re: Contributing code (Jacob Vanderplas)
>    3. Re: Contributing code (Olivier Grisel)
>    4. Re: How to approach "Sum of True and False Positives = 0"
>       (Josh Wasserstein)
>    5. Re: How to approach "Sum of True and False Positives = 0"
>       (Olivier Grisel)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 16 Oct 2013 18:09:05 -0300
> From: Carlos Aspillaga <caspill...@gmail.com>
> Subject: [Scikit-learn-general] Contributing code
> To: scikit-learn-general@lists.sourceforge.net
> Message-ID:
>         <
> cahs-nozkw7-rqkrxkw+qvxvb_40sejrr5m_unpkr8ghhxmv...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hello guys,
>
> I've been using scikit-learn for a while and I would like to contribute
> some new functionalities. I have them implemented in other languages, but
> not in python (yet...)
> To start, I would love to implement some classic Feature Selection
> Algorithms (dimensionality reduction) that I have used in other languages
> and I have been missing in pyhton...
> What do I have to do to start contributing?
>
> Waiting for your response,
>
> Carlos Aspillaga
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
> Message: 2
> Date: Thu, 17 Oct 2013 06:00:11 -0700
> From: Jacob Vanderplas <jake...@cs.washington.edu>
> Subject: Re: [Scikit-learn-general] Contributing code
> To: scikit-learn-general@lists.sourceforge.net
> Message-ID:
>         <
> cacpqbg0jsbe6begdt3bpoixjqkjmkqoqqupurh3wmwjvodt...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Carlos,
> Welcome!  We'd love to have you contribute.  You can start by reading
> through the developers guide on our website, and following the suggestions
> there: http://scikit-learn.org/stable/developers/
> Feel free to ask here if any questions come up,
>    Jake
>
>
> On Wed, Oct 16, 2013 at 2:09 PM, Carlos Aspillaga <caspill...@gmail.com
> >wrote:
>
> > Hello guys,
> >
> > I've been using scikit-learn for a while and I would like to contribute
> > some new functionalities. I have them implemented in other languages, but
> > not in python (yet...)
> > To start, I would love to implement some classic Feature Selection
> > Algorithms (dimensionality reduction) that I have used in other languages
> > and I have been missing in pyhton...
> > What do I have to do to start contributing?
> >
> > Waiting for your response,
> >
> > Carlos Aspillaga
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > October Webinars: Code for Performance
> > Free Intel webinars can help you accelerate application performance.
> > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
> > from
> > the latest Intel processors and coprocessors. See abstracts and register
> >
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
> Message: 3
> Date: Thu, 17 Oct 2013 15:03:09 +0200
> From: Olivier Grisel <olivier.gri...@ensta.org>
> Subject: Re: [Scikit-learn-general] Contributing code
> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net>
> Message-ID:
>         <CAFvE7K6=
> dt7y_zm+cyv2jbyi9ufnldxudp0iefrpjmdhejs...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> 2013/10/16 Carlos Aspillaga <caspill...@gmail.com>:
> > Hello guys,
> >
> > I've been using scikit-learn for a while and I would like to contribute
> some
> > new functionalities. I have them implemented in other languages, but not
> in
> > python (yet...)
> > To start, I would love to implement some classic Feature Selection
> > Algorithms (dimensionality reduction) that I have used in other languages
> > and I have been missing in pyhton...
>
> Which classic feature selection algorithms are you missing in particular?
>
> --
> Olivier
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 17 Oct 2013 09:42:53 -0400
> From: Josh Wasserstein <ribonucle...@gmail.com>
> Subject: Re: [Scikit-learn-general] How to approach "Sum of True and
>         False Positives = 0"
> To: "scikit-learn-general@lists.sourceforge.net"
>         <scikit-learn-general@lists.sourceforge.net>
> Message-ID:
>         <
> cad4ivxxuyz0d68c7e6tqrqag76ja1e-zzjd_cqc4az3vvaa...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Joel and others,
>
> Sorry, but I am still confused.. If I am using *stratified shuffle
> splitting
> *, shouldn't I always have *some positives in the testing set (I have
> positives in the full dataset)*? The message says: "The sum of true
> positives and false positives (in other words total #  of positives in the
> testing fold) are equal to zero for some labels"
>
> Thanks,
>
> Josh
>
>
>
>
> On Sat, Aug 24, 2013 at 7:13 PM, Josh Wasserstein <ribonucle...@gmail.com
> >wrote:
>
> > Thanks Joel. That makes sense.
> >
> > Josh
> >
> >
> > On Sat, Aug 24, 2013 at 5:57 PM, Joel Nothman <
> > jnoth...@student.usyd.edu.au> wrote:
> >
> >> On Sun, Aug 25, 2013 at 3:28 AM, Josh Wasserstein <
> ribonucle...@gmail.com
> >> > wrote:
> >>
> >>> I am working with on a multi-class classification problem with
> >>> admittedly very little data. My total datset has 29 examples with the
> >>> following label distribution:
> >>>
> >>> Label A: 15 examples
> >>> Label B: 8 examples
> >>> Label C: 6 examples
> >>>
> >>> For cross validation I am using stratified repeated K-fold CV, with K =
> >>> 3, and 20 repetitions
> >>>   sfs = StratifiedShuffleSplit(y,n_iter=n_iter,test_size=1.0/K)
> >>>
> >>> The problem comes when I do a SVM grid search, e.g.:
> >>>
> >>>     clf = GridSearchCV(SVC(C=1, cache_size=5000, probability=True),
> >>>                        tuned_parameters,
> >>>                        scoring=score_func,
> >>>                        verbose=1, n_jobs=1, cv=sfs)
> >>>     clf.fit(X, y)
> >>>
> >>> where score_func is usually one of:
> >>> f1_micro
> >>> f1_macro
> >>> f1_weighted
> >>>
> >>> I get warning messages like the following:
> >>>
> >>> > /path/to/python2.7/site-packages/sklearn/metrics/metrics.py:1249:
> >>> > UserWarning: The sum of true positives and false positives are equal
> >>> > to zero for some labels. Precision is ill defined for those labels
> >>> > [0].  The precision and recall are equal to zero for some labels.
> >>> > fbeta_score is ill defined for those labels [0 2].
> >>> > average=average)
> >>>
> >>> My questions are:
> >>>
> >>> *1. *Why does this happen? I thought that F1 scoring would choose an
> >>> operating point (i.e. a score threshold) where we get at least *some
> *positives
> >>> (regardless of whether they are FP or TP).
> >>>
> >>
> >> The threshold is chosen by the classifier, not the metric. But this is
> >> also often impossible: your classifier might return A and B ahead of C
> for
> >> instance, or it might validly predict a label that ism't present in your
> >> evaluation data.
> >>
> >> The reason for the warning is that you might argue that predicting 0
> >> entries should result in a precision of 1. You can also argue that it
> >> should be 0. Similarly for the recall of a predicted label that does not
> >> appear in your test data. This decision will make a big difference to
> macro
> >> F1.
> >>
> >> *2 *Can I reliably trust the scores that I get when I get this warning?
> >>>
> >>
> >> Scikit-learn opts for 0 in these cases, so the result is a lower bound
> on
> >> the metric. But a micro-average may be more suitable/stable.
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Introducing Performance Central, a new site from SourceForge and
> >> AppDynamics. Performance Central is your source for news, insights,
> >> analysis and resources for efficient Application Performance Management.
> >> Visit us today!
> >>
> >>
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
> Message: 5
> Date: Thu, 17 Oct 2013 16:37:05 +0200
> From: Olivier Grisel <olivier.gri...@ensta.org>
> Subject: Re: [Scikit-learn-general] How to approach "Sum of True and
>         False Positives = 0"
> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net>
> Message-ID:
>         <
> cafve7k6brtaeexlve9loanyzuz08+culgqt0eygz6xmtdck...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> 2013/10/17 Josh Wasserstein <ribonucle...@gmail.com>:
> > Hi Joel and others,
> >
> > Sorry, but I am still confused.. If I am using stratified shuffle
> splitting,
> > shouldn't I always have some positives in the testing set (I have
> positives
> > in the full dataset)? The message says: "The sum of true positives and
> false
> > positives (in other words total #  of positives in the testing fold) are
> > equal to zero for some labels"
>
> Sounds like a bug, can you try to iterate manually over the CV folds
> and check the labels you get from those train and test fold indices?
>
> Fix the random_state to get reproducible outcomes.
>
> --
> Olivier
>
>
>
> ------------------------------
>
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
> from
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
>
> ------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> End of Scikit-learn-general Digest, Vol 45, Issue 23
> ****************************************************
>

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-learn-general Digest, Vol 45, Issue 23

Reply via email to