Hi David, On 9 January 2013 02:14, David Broyles <[email protected]> wrote: > Hi, > > I'm pretty new to scikit-learn. I've generated a random forest > (classification) of 100 trees using default attributes. My data set has > over 2M examples. > > 2 questions: > > 1) I've noticed the size of the pickled model is quite large (e.g. ~9GB). A > comparable model trained with R's randomForest package is only about 40 GB > (and randomForest defaults for tree complexity seem similar to scikit's). I > don't believe randomForest is pruning the tree, but I could be wrong. Any > ideas what may be causing this large a difference?
We are not pruning the trees by default. They all are built until nodes contain less than min_samples_split=2 samples. If you are doing regression, note however that randomForest use nodesize=5 by default, leading to smaller trees, while we use min_samples_split=2 by default. Try using cPickle with HIGHEST_PROTOCOL or joblib.dump as Gael recommends. You may also use a gzip file handler in combination with any of these to further reduce the size of your file. > 2) Let's say I want each tree in the forest to be built off of a 200k sample > from the 2M examples. Does leaving the min_density at 0.1 achieve this, or > am I misunderstanding the role of this hyperparameter? min_density does not do that. It controls a trade-off between computing times and memory requirements that is specific to our algorithm, but no matter its value, all samples in X will be used to build the trees. The closer you can get to do that is to subsample X prior to `fit` and then build all trees on that same subsample. Hope this helps, Gilles ------------------------------------------------------------------------------ Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery and much more. Keep your Java skills current with LearnJavaNow - 200+ hours of step-by-step video tutorials by Java experts. SALE $49.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122612 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
