[ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740931#action_12740931
 ] 

Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

bq. have demonstrated that partitioning works to produce a usable forest 
because your errors on the partitioned forest seem similar

Yep, The last test shows that the results of the partial implementations are 
correct...or that the reference implementation is wrong, but I'm not 
considering this possibility (just kidding my first tests on the ref. impl. 
gave similar results to Breinman's paper)

bq. have demonstrated substantial speedup for large numbers of trees

Oh yeah it fast, the partial implementation running on my laptop is two times 
faster than the in-mem implementation running on a 10 nodes cluster !!!
but its oob error is not so good. I should use a larger dataset (why not KDD 
100%) with more trees and see what happens.

Actually there is a performance issue that I got using KDD 25% (hum...using 
bigger datasets seems to bring bigger problems). It should take a day or two to 
resolve.

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to