If you run out of memory at the prediction step, splitting the test
dataset in batches, then concatenating the results should work fine. Why
would it "skew" the results?
70GB RAM seems huge: for comparison here is some categorization
benchmarks on a 700k text dataset, that use more in the order of 5-10 GB
RAM,
https://github.com/FreeDiscovery/FreeDiscovery/issues/58
though with fairly short documents, for other algorithms and with a
smaller training set.
You could also try reducing the size of your dictionary with hashing.
If you really want to use random forest and have memory constraints, you
might want to use n_jobs=1 to avoid memory copies,
https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory
But as Joel was saying, random forest might not the best choice for huge
sparse arrays; NaiveBayes, LogisticRegression or SVM could be better
suited, or gradient boosting if you want to go that way...
On 16/03/17 02:44, Joel Nothman wrote:
Trees are not a traditional choice for bag of words models, but you
should make sure you are at least using the parameters of the random
forest to limit the size (depth, branching) of the trees.
On 16 March 2017 at 12:20, Sasha Kacanski <[email protected]
<mailto:[email protected]>> wrote:
Hi,
As soon as number of trees and features goes higher, 70Gb of ram is
gone and i am getting out of memory errors.
file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
but there is ton of text ...
with 10 estimators and 100 features per word I can't tackle ~900 k
of records ...
Training set, about 15% of data does perfectly fine but when test
come that is it.
i can split stuff and multiprocess it but I believe that will simply
skew results...
Any ideas?
--
Aleksandar Kacanski
_______________________________________________
scikit-learn mailing list
[email protected] <mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn