I have a follow up question regarding the usage of sample_weights for
fitting the RandomForestClassifier. Does the predict_proba method take the
sample weights (used during fitting) into account as well? I spent some
time trying to understand the _tree.pyc and tree.py files in the codebase
but still I am a little fuzzy about how the predict_proba code works when
the sample_weights are present.

I have an unbalanced data set (1:12 ratio) and I find that the
probabilities are highly skewed towards the majority class even after using
sample weights.

I am planning to use isotonic regression to calibrate my predictions but it
will be nice to have a less skewed input into the calibration algorithms.


On Thu, Feb 7, 2013 at 11:33 PM, Gilles Louppe <g.lou...@gmail.com> wrote:

> Hello,
>
> You might achieve what you want by using sample weights when fitting
> your forest (See the 'sample_weight' parameter). There is also a
> 'balance_weights' method from the preprocessing module that basically
> generates sample weights for you, such that classes become balanced.
>
>
> https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221
>
> (This should appear in the reference, I'll fix that)
>
> Hope this helps,
>
> Gilles
>
> On 8 February 2013 00:44, Manish Amde <manish...@gmail.com> wrote:
> > Fellow sklearners,
> >
> > I am working on a classification problem with an unbalanced data set and
> > have been successful using SVM classifiers with the class_weight option.
> >
> > I have also tried Random Forests and am getting a decent ROC performance
> but
> > I am hoping to get a performance improvement by using Weighted or
> Balanced
> > Random Forests as suggested in this paper.
> > http://www.stat.berkeley.edu/tech-reports/666.pdf
> >
> > I don't see any implementation of these options but I might be mistaken
> so I
> > wanted to ask the community. Also, I am willing to write code and
> contribute
> > back if this will be useful to other folks.
> >
> > I have also thought about balancing the data using up/down sampling the
> > minority/majority class (with or without replacement) and even SMOTE but
> > couldn't find those implementation in the scikit-learn library yet.  The
> > modified Random Forests seem to outperform these methods according to the
> > paper, hence I am interested in trying those first.
> >
> > -Manish
> >
> >
> ------------------------------------------------------------------------------
> > Free Next-Gen Firewall Hardware Offer
> > Buy your Sophos next-gen firewall before the end March 2013
> > and get the hardware for free! Learn more.
> > http://p.sf.net/sfu/sophos-d2d-feb
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to