[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Deneche A. Hakim (JIRA) Thu, 10 Sep 2009 04:21:26 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753572#action_12753572
 ]


Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

bq. What about using the Yahoo 0.20 distribution?  
(http://developer.yahoo.com/hadoop/distribution/ )

Yahoo distribution did the job !

I launched the tests on a 10-nodes cluster with KDD10, and apart from a 
difference in execution time, the 0.20.0 implementation uses one more step, the 
results are the same

bq. For now I'm not able to process KDD 100% because a limitation in my code. 
The Partial Builder takes 6 minutes to build 100 with 10 maps, but the example 
program hangs when comparing the forest predictions with the data labels, 
because the current example code loads the whole dataset in memory before 
checking the labels =P

* TODO: no need to load the whole dataset in memory just to extract the labels, 
this should help when dealing with large datasets

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_10.patch, partial_August_13.patch, 
> partial_August_15.patch, partial_August_17.patch, partial_August_19.patch, 
> partial_August_2.patch, partial_August_24.patch, partial_August_27.patch, 
> partial_August_31.patch, partial_August_9.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Reply via email to