Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

Daniel Homola Wed, 15 Apr 2015 09:16:15 -0700

Hi Andy,

So at each iteration the x predictor matrix (n by m) is practicallycopied and each column is shuffled in the copied version. This shuffledmatrix is then copied next to the original (n by 2m) and fed into theRF, to get the feature importances.Also at the start of the method, a vector with length m is initializedwith zeros, called hitReg.After the RF training, each feature's importance in x is checked againstthe maximum of the shuffled ones. Those that are higher, are recorded byincreasing their index in the vector hitReg.At each iteration the method checks which feature is doing better thanexpected by random chance. So if we are in the 10th iteration, andfeature F was better than the max of the shuffled ones, 8 times, we getp= .01 with sp.stats.binom.sf(8, 10, .5). We correct for multipletesting, and if the feature is still significant, we record it as a"confirmed" or important one. Conversely if feature F was only betteronce (sp.stats.binom.cdf(1, 10, .5)), we reject it and delete it fromthe x matrix. The method ends if all features are either rejected orconfirmed or if the number of iterations reaches the user set max.


Cheers,
Dan



On 15/04/15 16:56, Andreas Mueller wrote:

Hi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and thefeatures importances are compared against the control.
The question is whether the feature importance that is used isdifferent from ours. Gilles?
If not, this could be hard to add. If it is the same, I think ameta-estimator would be a nice addition to the feature selection module.
Cheers,
Andy


On 04/15/2015 11:32 AM, Daniel Homola wrote:
Hi Andy,
This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited79 times according to Google Scholar.
Regarding your second point, the first 3 questions of the FAQ on theBoruta website answers it I guess.. https://m2.icm.edu.pl/boruta/
 1. *So, what's so special about Boruta?* It is an all relevant
    feature selection method, while most other are minimal optimal;
    this means it tries to find all features carrying information
    usable for prediction, rather than finding a possibly compact
    subset of features on which some classifier has a minimal error.
    Here is a paper with the details.
 2. *Why should I care?* For a start, when you try to understand the
    phenomenon that made your data, you should care about all factors
    that contribute to it, not just the bluntest signs of it in
    context of your methodology (yes, minimal optimal set of features
    by definition depends on your classifier choice).
 3. *But I only care about good classification accuracy!* So you also
    care about having a robust model; in p≫n problems, one can
    usually cherry-pick a nonsense subset of features which yields
    good or even perfect classification – minimal optimal methods can
    easily get deceived by that, leaving you with an overfitted model
    and no sign that something is wrong. See this or that for an example.
I'm not an ML expert by any means but it seemed reasonable to me. Anythoughts?
Cheers,
Dan




On 15/04/15 16:23, Andreas Mueller wrote:
Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar toRFE(RandomForestClassifier()).Is it qualitatively different from that? Does it use a differentfeature importance?
btw: your mail is flagged as spam as your link is broken and linksto some imperial college internal page.
Cheers,
Andy

On 04/15/2015 05:03 AM, Daniel Homola wrote:
Hi all,
I needed a multivariate feature selection method for my work. AsI'm working with biological/medical data, where n < p or even n <<p I started to read up on Random Foretst based methods, as in mylimited understanding RF copes pretty well with this suboptimalsituation.
I came across an R package calledBoruta:https://m2.icm.edu.pl/boruta/<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressivecitations I thought I'd try it, but it was really slow. So Ithought I'll reimplement it in Python, because I hoped (based onthishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would considerincorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about howshould I prepare the code, what conventions should I follow, etc?
Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London


------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

Reply via email to