[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Deneche A. Hakim (JIRA) Tue, 11 Aug 2009 01:39:40 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741759#action_12741759
 ]


Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

bq. These are confusing numbers. First, why does the number of trees vary like 
this?

Hmm...well...my primary focus was to check if the implementation was able to 
handle larger datasets. I shall run another, more coherent, batch of tests soon

bq. Secondly, the oob error jumps around a lot in confusing ways.

bq. Thirdly, the times don't seem to match what I would expect. Moreover, KDD10 
at 10 and 50 map tasks take exactly the same amount of time.

Ouch It's a copy and paste brain-bug !!! Ok, I'll be more careful with the next 
test
 
bq. My expectation would have been that running 20 map tasks would do almost 
twice as well as running 10 because we have 10 machines each of which is dual 
core. Running 50 map tasks should be about the same as 20. We see that pattern 
on KDD25 except we don't have a datapoint for 50 maps.

Re-Ouch, I used the same cofiguration that I used with the In-Mem 
implementation: *mapred.tasktracker.map.tasks.maximum=1* only one mapper at a 
time on each node

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_10.patch, partial_August_2.patch, 
> partial_August_9.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Reply via email to