Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Alex Kopp Wed, 24 Apr 2013 05:50:08 -0700

Thanks, guys.

Perhaps I should explain what I am trying to do and then open it up for
suggestions.


I have 203k training examples each with 457k features. The features are
composed of one-hot encoded categorical values as well as stemmed, TFIDF
weighted unigrams and bigrams (NLP). As you can probably guess, the
overwhelming majority of the features are the unigrams and bigrams.

In the end, I am looking to build a regression model. I have tried a grid
search on SGDRegressor, but have not had any promising results (~0.00 or
even negative R^2 values).

I would appreciate ideas/suggestions.

Thanks

ps, if it matters, I have 8 cores and 52gb ram at my disposal.

On Wed, Apr 24, 2013 at 5:32 AM, Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:

>
>
>
> 2013/4/24 Olivier Grisel <olivier.gri...@ensta.org>
>
>> 2013/4/24 Peter Prettenhofer <peter.prettenho...@gmail.com>:
>> > I totally agree with Brian - although I'd suggest you drop option 3)
>> because
>> > it will be a lot of work.
>> >
>> > I'd suggest you rather should do a) feature extraction or b) feature
>> > selection.
>> >
>> > Personally, I think decision trees in general and random forest in
>> > particular are not a good fit for sparse datasets - if the average
>> number of
>> > non-zero values for each feature is low than your partitions will be
>> > relatively small - any subsequent splits will make the partitions even
>> > smaller thus you cannot grow your trees deep since you will run out of
>> > samples. This means that your tree in fact uses just a tiny fraction of
>> the
>> > available features (compared to a deep tree) - unless you have a few
>> pretty
>> > strong features or you train lots of trees this won't work out. This is
>> > probably also the reason why most of the decision tree work in natural
>> > language processing is done using boosted decision trees of depth one.
>> If
>> > your features are boolean than such a model is in fact pretty similar
>> to a
>> > simple logistic regression model.
>> >
>> > I've the impression that Random Forest in particular is a poor "evidence
>> > accumulator" (pooling evidence from lots of weak features) - linear
>> models
>> > and boosted trees are much better here.
>>
>> Very interesting consideration. Any reference paper to study this in
>> more details (both theory and empirical validation)?
>>
>
> actually, no - just gut feeling based on how decision trees / RF works
> (hard non-intersecting partitions) - I will try to digg something up -
> would definitely like to hear any critics/remarks to my view though.
>
>
>>
>> Also do you have good paper that demonstrate state of the art results
>> with boosted stumps for NLP?
>>
>
> I haven't seen any use of boosted stumps in NLP for a while - but maybe I
> didn't pay close attention - what comes to my mind is some work by Xavier
> Carreras on NER for CoNLL 2002 (see [1] for an overview of the shared task
> - actually, a number of participants used boosting/trees).
> Joseph Turian used boosting in his thesis on parsing [2].
>
> [1] http://acl.ldc.upenn.edu/W/W02/W02-2024.pdf
> [2] http://cs.nyu.edu/web/Research/Theses/turian_joseph.pdf
>
>
>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> --
> Peter Prettenhofer
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Reply via email to