Thanks for your comments Peter and Brian.

On 01/10/2012 02:24 PM, Peter Prettenhofer wrote:
> 2012/1/10 Andreas<[email protected]>:
>    
>> Next question about DecisionTrees:
>> I am not sure if I understand the documentation correctly. It says:
>> "Setting min_density to 0 will always use the sample mask to select the
>> subset of samples at each node.
>> This results in little to no additional memory being allocated, making it
>> appropriate for massive datasets or within ensemble learners,
>> but at the expense of being slower when training deep trees. "
>>
>> This sounds to me as if "min_density=0" is slowest but takes the least
>> memory. Is that what is meant?
>>      
> Correct, but I've you grow your trees deep than the runtime overhead
> should be significant.
>
>    
I have not explicitly looked at the depth of the trees but
by default max_depth=None, meaning the trees are grown
to full depth. I haven't tried with max_features=num_features,
though.

>> When doing benchmarking, I found "min_density=0" to be the fastest version
>> on my dataset.
>> It has set n_samples = 6180, n_features = 2000, n_class=10,
>>
>> The I tried with MNIST (n_samples=60000, n_features=786, n_class=10) and
>> found
>> min_density=0 to be slower than .1 (twice as long) but  .5 to be slower than
>> .1
>>      
> that sounds reasonable - 0.5 triggers a fancy indexing op (=copy)
> whenever more than 50% of the samples are out of the (current) mask.
> Which means that basically whenever you make a split either the left
> or the right child will be fancy indexed. Fancy indexing itself is
> costly and must be amortized by less time spend traversing the sample
> mask.
>
>    
Yeah, it does seem reasonable. Still, I feel it is hard
to judge when the fancy indexing cost is amortized.

The decision to let the user make the choice seems good,
though from what is in the documentation, it was not
clear what the choices of the parameter meant.
After reading through the cython and doing some profiling
I got there ;)

> Thanks for your analysis that was really useful - we should modify the
> docstrings to make this more clear.
>    
That would be great. Maybe some form of Brian's explanation
could be included.


A related question:
I know you spent a lot of time tweaking the decision tree code.
Do you think there might still be some potential there?

In particular, would it be possible to use lists of pointers/indices 
instead of the
masks to avoid iterating over masked out items? Or do you think
that would create to much overhead?

The current code works great for me (thanks for contributing!!!!),
still it would mean a lot if I could make it even faster. At the moment 
it takes me
about 8 hours to grow a tree with only a subset of the features
that I actually want to use.... I have a 128 core cluster here but then 
building
a forest with 1000 trees would still take roughly 6 days....

Thanks,
Andy

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to