I totally agree with Brian - although I'd suggest you drop option 3)
because it will be a lot of work.
I'd suggest you rather should do a) feature extraction or b) feature
selection.
Personally, I think decision trees in general and random forest in
particular are not a good fit for sparse datasets - if the average number
of non-zero values for each feature is low than your partitions will be
relatively small - any subsequent splits will make the partitions even
smaller thus you cannot grow your trees deep since you will run out of
samples. This means that your tree in fact uses just a tiny fraction of the
available features (compared to a deep tree) - unless you have a few pretty
strong features or you train lots of trees this won't work out. This is
probably also the reason why most of the decision tree work in natural
language processing is done using boosted decision trees of depth one. If
your features are boolean than such a model is in fact pretty similar to a
simple logistic regression model.
I've the impression that Random Forest in particular is a poor "evidence
accumulator" (pooling evidence from lots of weak features) - linear models
and boosted trees are much better here.
best,
Peter
2013/4/24 Brian Holt <bdho...@gmail.com>
> At the moment your three options are
> 1) get more memory
> 2) do feature selection - 400k features on 200k samples seems to me to
> contain a lot of redundant information or irrelevant features
> 3) submit a PR to support dense matrices - this is going to be a lot of
> work and I doubt it's worth it.
>
> All the best
> Brian
> On Apr 24, 2013 5:14 AM, "Calvin Morrison" <mutanttur...@gmail.com> wrote:
>
>> get more memory?
>>
>> On 23 April 2013 17:06, Alex Kopp <ark...@cornell.edu> wrote:
>> > Hi,
>> >
>> > I am looking to build a random forest regression model with a pretty
>> large
>> > amount of sparse data. I noticed that I cannot fit the random forest
>> model
>> > with a sparse matrix. Unfortunately, a dense matrix is too large to fit
>> in
>> > memory. What are my options?
>> >
>> > For reference, I have just over 400k features and just over 200k
>> training
>> > examples
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Try New Relic Now & We'll Send You this Cool Shirt
>> > New Relic is the only SaaS-based application performance monitoring
>> service
>> > that delivers powerful full stack analytics. Optimize and monitor your
>> > browser, app, & servers with just a few lines of code. Try New Relic
>> > and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > Scikit-learn-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Peter Prettenhofer
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general