Hi David,

On 9 January 2013 02:14, David Broyles <[email protected]> wrote:
> Hi,
>
> I'm pretty new to scikit-learn.  I've generated a random forest
> (classification) of 100 trees using default attributes.  My data set has
> over 2M examples.
>
> 2 questions:
>
> 1) I've noticed the size of the pickled model is quite large (e.g. ~9GB).  A
> comparable model trained with R's randomForest package is only about 40 GB
> (and randomForest defaults for tree complexity seem similar to scikit's).  I
> don't believe randomForest is pruning the tree, but I could be wrong.  Any
> ideas what may be causing this large a difference?

We are not pruning the trees by default. They all are built until
nodes contain less than min_samples_split=2 samples. If you are doing
regression, note however that randomForest use nodesize=5 by default,
leading to smaller trees, while we use min_samples_split=2 by default.

Try using cPickle with HIGHEST_PROTOCOL or joblib.dump as Gael
recommends. You may also use a gzip file handler in combination with
any of these to further reduce the size of your file.

> 2) Let's say I want each tree in the forest to be built off of a 200k sample
> from the 2M examples.  Does leaving the min_density at 0.1 achieve this, or
> am I misunderstanding the role of this hyperparameter?

min_density does not do that. It controls a trade-off between
computing times and memory requirements that is specific to our
algorithm, but no matter its value, all samples in X will be used to
build the trees.

The closer you can get to do that is to subsample X prior to `fit` and
then build all trees on that same subsample.

Hope this helps,

Gilles

------------------------------------------------------------------------------
Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery
and much more. Keep your Java skills current with LearnJavaNow -
200+ hours of step-by-step video tutorials by Java experts.
SALE $49.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122612 
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to