2012/1/10 Andreas <[email protected]>:
> Next question about DecisionTrees:
> I am not sure if I understand the documentation correctly. It says:
> "Setting min_density to 0 will always use the sample mask to select the
> subset of samples at each node.
> This results in little to no additional memory being allocated, making it
> appropriate for massive datasets or within ensemble learners,
> but at the expense of being slower when training deep trees. "
>
> This sounds to me as if "min_density=0" is slowest but takes the least
> memory. Is that what is meant?

Correct, but I've you grow your trees deep than the runtime overhead
should be significant.

>
> When doing benchmarking, I found "min_density=0" to be the fastest version
> on my dataset.
> It has set n_samples = 6180, n_features = 2000, n_class=10,
>
> The I tried with MNIST (n_samples=60000, n_features=786, n_class=10) and
> found
> min_density=0 to be slower than .1 (twice as long) but  .5 to be slower than
> .1

that sounds reasonable - 0.5 triggers a fancy indexing op (=copy)
whenever more than 50% of the samples are out of the (current) mask.
Which means that basically whenever you make a split either the left
or the right child will be fancy indexed. Fancy indexing itself is
costly and must be amortized by less time spend traversing the sample
mask.

>
>
> On digits, since training is very fast, it was hard do measure any real
> difference.
> Still, min_density=0 was fastest and min_density=1 was slowest (1.5 times as
> slow).
>
> I use the default settings from RandomForest with has max_depth=None,
> n_features=auto
> and I am using only one tree (n_estimators=1).
> On which data sets did the statement in the documentation hold?

we did choose the default parameter (min_density=0.1) on covertype
(large number of samples, few features), madelone, (and arcene) - none
of these has a significant number of features.

>
> It seems to me that there is some sweet spot for each dataset and that on
> the datasets
> I tested, low values seem faster. Setting min_density=1 was often very slow
>
> What are your experiences?

For shallow trees (in particular base learners for boosting) I use
min_density=0 too.

>
> While .1 seems a good default value, it doesn't seem to be a tradeoff
> between
> time and memory on the datasets I tested. Rather it seems to be
> the value that makes the algorithm runs fastest.

Agreed - the memory tradeoff is neglectable since we basically build
the tree in a depth first fashion and the deeper you get the smaller
the the arrays become, thus, the less memory they consume. The time it
takes to traverse the sample_mask is the major limiting factor.

Thanks for your analysis that was really useful - we should modify the
docstrings to make this more clear.

best,
 Peter

>
> Any help / comment / remarks very welcome!
>
> Thanks,
> Andy
>
>
> ------------------------------------------------------------------------------
> Write once. Port to many.
> Get the SDK and tools to simplify cross-platform app development. Create
> new or port existing apps to sell to consumers worldwide. Explore the
> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
> http://p.sf.net/sfu/intel-appdev
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to