Thank you very much... I will try alternatives Sasha Kacanski
On Mar 16, 2017 8:28 AM, "Roman Yurchak" <[email protected]> wrote: > If you run out of memory at the prediction step, splitting the test > dataset in batches, then concatenating the results should work fine. Why > would it "skew" the results? > > 70GB RAM seems huge: for comparison here is some categorization benchmarks > on a 700k text dataset, that use more in the order of 5-10 GB RAM, > https://github.com/FreeDiscovery/FreeDiscovery/issues/58 > though with fairly short documents, for other algorithms and with a > smaller training set. > > You could also try reducing the size of your dictionary with hashing. > If you really want to use random forest and have memory constraints, you > might want to use n_jobs=1 to avoid memory copies, > > https://www.quora.com/Why-is-scikit-learns-random-forest-usi > ng-so-much-memory > > But as Joel was saying, random forest might not the best choice for huge > sparse arrays; NaiveBayes, LogisticRegression or SVM could be better > suited, or gradient boosting if you want to go that way... > > > On 16/03/17 02:44, Joel Nothman wrote: > >> Trees are not a traditional choice for bag of words models, but you >> should make sure you are at least using the parameters of the random >> forest to limit the size (depth, branching) of the trees. >> >> On 16 March 2017 at 12:20, Sasha Kacanski <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi, >> As soon as number of trees and features goes higher, 70Gb of ram is >> gone and i am getting out of memory errors. >> file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns >> but there is ton of text ... >> with 10 estimators and 100 features per word I can't tackle ~900 k >> of records ... >> Training set, about 15% of data does perfectly fine but when test >> come that is it. >> >> i can split stuff and multiprocess it but I believe that will simply >> skew results... >> >> Any ideas? >> >> >> -- >> Aleksandar Kacanski >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] <mailto:[email protected]> >> https://mail.python.org/mailman/listinfo/scikit-learn >> <https://mail.python.org/mailman/listinfo/scikit-learn> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
