2013/4/24 Peter Prettenhofer <peter.prettenho...@gmail.com>: > I totally agree with Brian - although I'd suggest you drop option 3) because > it will be a lot of work. > > I'd suggest you rather should do a) feature extraction or b) feature > selection. > > Personally, I think decision trees in general and random forest in > particular are not a good fit for sparse datasets - if the average number of > non-zero values for each feature is low than your partitions will be > relatively small - any subsequent splits will make the partitions even > smaller thus you cannot grow your trees deep since you will run out of > samples. This means that your tree in fact uses just a tiny fraction of the > available features (compared to a deep tree) - unless you have a few pretty > strong features or you train lots of trees this won't work out. This is > probably also the reason why most of the decision tree work in natural > language processing is done using boosted decision trees of depth one. If > your features are boolean than such a model is in fact pretty similar to a > simple logistic regression model. > > I've the impression that Random Forest in particular is a poor "evidence > accumulator" (pooling evidence from lots of weak features) - linear models > and boosted trees are much better here.
Very interesting consideration. Any reference paper to study this in more details (both theory and empirical validation)? Also do you have good paper that demonstrate state of the art results with boosted stumps for NLP? -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general