If you run out of memory at the prediction step, splitting the test dataset in batches, then concatenating the results should work fine. Why would it "skew" the results?

70GB RAM seems huge: for comparison here is some categorization benchmarks on a 700k text dataset, that use more in the order of 5-10 GB RAM,
    https://github.com/FreeDiscovery/FreeDiscovery/issues/58
though with fairly short documents, for other algorithms and with a smaller training set.

You could also try reducing the size of your dictionary with hashing.
If you really want to use random forest and have memory constraints, you might want to use n_jobs=1 to avoid memory copies,

https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory

But as Joel was saying, random forest might not the best choice for huge sparse arrays; NaiveBayes, LogisticRegression or SVM could be better suited, or gradient boosting if you want to go that way...


On 16/03/17 02:44, Joel Nothman wrote:
Trees are not a traditional choice for bag of words models, but you
should make sure you are at least using the parameters of the random
forest to limit the size (depth, branching) of the trees.

On 16 March 2017 at 12:20, Sasha Kacanski <[email protected]
<mailto:[email protected]>> wrote:

    Hi,
    As soon as number of trees and features goes higher, 70Gb of ram is
    gone and i am getting out of memory errors.
    file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
    but there is ton of text ...
    with 10 estimators and 100 features per word I can't tackle ~900 k
    of records ...
    Training set, about 15% of data does perfectly fine but when test
    come that is it.

    i can split stuff and multiprocess it but I believe that will simply
    skew results...

    Any ideas?


    --
    Aleksandar Kacanski

    _______________________________________________
    scikit-learn mailing list
    [email protected] <mailto:[email protected]>
    https://mail.python.org/mailman/listinfo/scikit-learn
    <https://mail.python.org/mailman/listinfo/scikit-learn>




_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to