2013/4/24 Peter Prettenhofer <peter.prettenho...@gmail.com>:
> I totally agree with Brian - although I'd suggest you drop option 3) because
> it will be a lot of work.
>
> I'd suggest you rather should do a) feature extraction or b) feature
> selection.
>
> Personally, I think decision trees in general and random forest in
> particular are not a good fit for sparse datasets - if the average number of
> non-zero values for each feature is low than your partitions will be
> relatively small - any subsequent splits will make the partitions even
> smaller thus you cannot grow your trees deep since you will run out of
> samples. This means that your tree in fact uses just a tiny fraction of the
> available features (compared to a deep tree) - unless you have a few pretty
> strong features or you train lots of trees this won't work out. This is
> probably also the reason why most of the decision tree work in natural
> language processing is done using boosted decision trees of depth one. If
> your features are boolean than such a model is in fact pretty similar to a
> simple logistic regression model.
>
> I've the impression that Random Forest in particular is a poor "evidence
> accumulator" (pooling evidence from lots of weak features) - linear models
> and boosted trees are much better here.

Very interesting consideration. Any reference paper to study this in
more details (both theory and empirical validation)?

Also do you have good paper that demonstrate state of the art results
with boosted stumps for NLP?

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to