Re: [scikit-learn] best way to scale on the random forest for text w bag of words ...

Sasha Kacanski Thu, 16 Mar 2017 07:25:54 -0700

Thank you very much...
I will try alternatives

Sasha Kacanski


On Mar 16, 2017 8:28 AM, "Roman Yurchak" <[email protected]> wrote:

> If you run out of memory at the prediction step, splitting the test
> dataset in batches, then concatenating the results should work fine. Why
> would it "skew" the results?
>
> 70GB RAM seems huge: for comparison here is some categorization benchmarks
> on a 700k text dataset, that use more in the order of 5-10 GB RAM,
>     https://github.com/FreeDiscovery/FreeDiscovery/issues/58
> though with fairly short documents, for other algorithms and with a
> smaller training set.
>
> You could also try reducing the size of your dictionary with hashing.
> If you really want to use random forest and have memory constraints, you
> might want to use n_jobs=1 to avoid memory copies,
>
> https://www.quora.com/Why-is-scikit-learns-random-forest-usi
> ng-so-much-memory
>
> But as Joel was saying, random forest might not the best choice for huge
> sparse arrays; NaiveBayes, LogisticRegression or SVM could be better
> suited, or gradient boosting if you want to go that way...
>
>
> On 16/03/17 02:44, Joel Nothman wrote:
>
>> Trees are not a traditional choice for bag of words models, but you
>> should make sure you are at least using the parameters of the random
>> forest to limit the size (depth, branching) of the trees.
>>
>> On 16 March 2017 at 12:20, Sasha Kacanski <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi,
>>     As soon as number of trees and features goes higher, 70Gb of ram is
>>     gone and i am getting out of memory errors.
>>     file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
>>     but there is ton of text ...
>>     with 10 estimators and 100 features per word I can't tackle ~900 k
>>     of records ...
>>     Training set, about 15% of data does perfectly fine but when test
>>     come that is it.
>>
>>     i can split stuff and multiprocess it but I believe that will simply
>>     skew results...
>>
>>     Any ideas?
>>
>>
>>     --
>>     Aleksandar Kacanski
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     [email protected] <mailto:[email protected]>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] best way to scale on the random forest for text w bag of words ...

Reply via email to