Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Peter Prettenhofer Wed, 24 Apr 2013 02:33:05 -0700

2013/4/24 Olivier Grisel <olivier.gri...@ensta.org>

> 2013/4/24 Peter Prettenhofer <peter.prettenho...@gmail.com>:
> > I totally agree with Brian - although I'd suggest you drop option 3)
> because
> > it will be a lot of work.
> >
> > I'd suggest you rather should do a) feature extraction or b) feature
> > selection.
> >
> > Personally, I think decision trees in general and random forest in
> > particular are not a good fit for sparse datasets - if the average
> number of
> > non-zero values for each feature is low than your partitions will be
> > relatively small - any subsequent splits will make the partitions even
> > smaller thus you cannot grow your trees deep since you will run out of
> > samples. This means that your tree in fact uses just a tiny fraction of
> the
> > available features (compared to a deep tree) - unless you have a few
> pretty
> > strong features or you train lots of trees this won't work out. This is
> > probably also the reason why most of the decision tree work in natural
> > language processing is done using boosted decision trees of depth one. If
> > your features are boolean than such a model is in fact pretty similar to
> a
> > simple logistic regression model.
> >
> > I've the impression that Random Forest in particular is a poor "evidence
> > accumulator" (pooling evidence from lots of weak features) - linear
> models
> > and boosted trees are much better here.
>
> Very interesting consideration. Any reference paper to study this in
> more details (both theory and empirical validation)?
>


actually, no - just gut feeling based on how decision trees / RF works
(hard non-intersecting partitions) - I will try to digg something up -
would definitely like to hear any critics/remarks to my view though.


>
> Also do you have good paper that demonstrate state of the art results
> with boosted stumps for NLP?
>

I haven't seen any use of boosted stumps in NLP for a while - but maybe I
didn't pay close attention - what comes to my mind is some work by Xavier
Carreras on NER for CoNLL 2002 (see [1] for an overview of the shared task
- actually, a number of participants used boosting/trees).
Joseph Turian used boosting in his thesis on parsing [2].

[1] http://acl.ldc.upenn.edu/W/W02/W02-2024.pdf
[2] http://cs.nyu.edu/web/Research/Theses/turian_joseph.pdf



> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Reply via email to