[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740931#action_12740931
]
Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------
bq. have demonstrated that partitioning works to produce a usable forest
because your errors on the partitioned forest seem similar
Yep, The last test shows that the results of the partial implementations are
correct...or that the reference implementation is wrong, but I'm not
considering this possibility (just kidding my first tests on the ref. impl.
gave similar results to Breinman's paper)
bq. have demonstrated substantial speedup for large numbers of trees
Oh yeah it fast, the partial implementation running on my laptop is two times
faster than the in-mem implementation running on a 10 nodes cluster !!!
but its oob error is not so good. I should use a larger dataset (why not KDD
100%) with more trees and see what happens.
Actually there is a performance issue that I got using KDD 25% (hum...using
bigger datasets seems to bring bigger problems). It should take a day or two to
resolve.
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
> Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.