Hi Andy,
So at each iteration the x predictor matrix (n by m) is practically
copied and each column is shuffled in the copied version. This shuffled
matrix is then copied next to the original (n by 2m) and fed into the
RF, to get the feature importances.
Also at the start of the method, a vector with length m is initialized
with zeros, called hitReg.
After the RF training, each feature's importance in x is checked against
the maximum of the shuffled ones. Those that are higher, are recorded by
increasing their index in the vector hitReg.
At each iteration the method checks which feature is doing better than
expected by random chance. So if we are in the 10th iteration, and
feature F was better than the max of the shuffled ones, 8 times, we get
p= .01 with sp.stats.binom.sf(8, 10, .5). We correct for multiple
testing, and if the feature is still significant, we record it as a
"confirmed" or important one. Conversely if feature F was only better
once (sp.stats.binom.cdf(1, 10, .5)), we reject it and delete it from
the x matrix. The method ends if all features are either rejected or
confirmed or if the number of iterations reaches the user set max.
Cheers,
Dan
On 15/04/15 16:56, Andreas Mueller wrote:
Hi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and the
features importances are compared against the control.
The question is whether the feature importance that is used is
different from ours. Gilles?
If not, this could be hard to add. If it is the same, I think a
meta-estimator would be a nice addition to the feature selection module.
Cheers,
Andy
On 04/15/2015 11:32 AM, Daniel Homola wrote:
Hi Andy,
This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited
79 times according to Google Scholar.
Regarding your second point, the first 3 questions of the FAQ on the
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/
1. *So, what's so special about Boruta?* It is an all relevant
feature selection method, while most other are minimal optimal;
this means it tries to find all features carrying information
usable for prediction, rather than finding a possibly compact
subset of features on which some classifier has a minimal error.
Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors
that contribute to it, not just the bluntest signs of it in
context of your methodology (yes, minimal optimal set of features
by definition depends on your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in p≫n problems, one can
usually cherry-pick a nonsense subset of features which yields
good or even perfect classification – minimal optimal methods can
easily get deceived by that, leaving you with an overfitted model
and no sign that something is wrong. See this or that for an example.
I'm not an ML expert by any means but it seemed reasonable to me. Any
thoughts?
Cheers,
Dan
On 15/04/15 16:23, Andreas Mueller wrote:
Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different
feature importance?
btw: your mail is flagged as spam as your link is broken and links
to some imperial college internal page.
Cheers,
Andy
On 04/15/2015 05:03 AM, Daniel Homola wrote:
Hi all,
I needed a multivariate feature selection method for my work. As
I'm working with biological/medical data, where n < p or even n <<
p I started to read up on Random Foretst based methods, as in my
limited understanding RF copes pretty well with this suboptimal
situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I
thought I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general