2012/1/10 Andreas <[email protected]>: > Next question about DecisionTrees: > I am not sure if I understand the documentation correctly. It says: > "Setting min_density to 0 will always use the sample mask to select the > subset of samples at each node. > This results in little to no additional memory being allocated, making it > appropriate for massive datasets or within ensemble learners, > but at the expense of being slower when training deep trees. " > > This sounds to me as if "min_density=0" is slowest but takes the least > memory. Is that what is meant?
Correct, but I've you grow your trees deep than the runtime overhead should be significant. > > When doing benchmarking, I found "min_density=0" to be the fastest version > on my dataset. > It has set n_samples = 6180, n_features = 2000, n_class=10, > > The I tried with MNIST (n_samples=60000, n_features=786, n_class=10) and > found > min_density=0 to be slower than .1 (twice as long) but .5 to be slower than > .1 that sounds reasonable - 0.5 triggers a fancy indexing op (=copy) whenever more than 50% of the samples are out of the (current) mask. Which means that basically whenever you make a split either the left or the right child will be fancy indexed. Fancy indexing itself is costly and must be amortized by less time spend traversing the sample mask. > > > On digits, since training is very fast, it was hard do measure any real > difference. > Still, min_density=0 was fastest and min_density=1 was slowest (1.5 times as > slow). > > I use the default settings from RandomForest with has max_depth=None, > n_features=auto > and I am using only one tree (n_estimators=1). > On which data sets did the statement in the documentation hold? we did choose the default parameter (min_density=0.1) on covertype (large number of samples, few features), madelone, (and arcene) - none of these has a significant number of features. > > It seems to me that there is some sweet spot for each dataset and that on > the datasets > I tested, low values seem faster. Setting min_density=1 was often very slow > > What are your experiences? For shallow trees (in particular base learners for boosting) I use min_density=0 too. > > While .1 seems a good default value, it doesn't seem to be a tradeoff > between > time and memory on the datasets I tested. Rather it seems to be > the value that makes the algorithm runs fastest. Agreed - the memory tradeoff is neglectable since we basically build the tree in a depth first fashion and the deeper you get the smaller the the arrays become, thus, the less memory they consume. The time it takes to traverse the sample_mask is the major limiting factor. Thanks for your analysis that was really useful - we should modify the docstrings to make this more clear. best, Peter > > Any help / comment / remarks very welcome! > > Thanks, > Andy > > > ------------------------------------------------------------------------------ > Write once. Port to many. > Get the SDK and tools to simplify cross-platform app development. Create > new or port existing apps to sell to consumers worldwide. Explore the > Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join > http://p.sf.net/sfu/intel-appdev > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer ------------------------------------------------------------------------------ Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consumers worldwide. Explore the Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join http://p.sf.net/sfu/intel-appdev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
