[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741759#action_12741759
]
Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------
bq. These are confusing numbers. First, why does the number of trees vary like
this?
Hmm...well...my primary focus was to check if the implementation was able to
handle larger datasets. I shall run another, more coherent, batch of tests soon
bq. Secondly, the oob error jumps around a lot in confusing ways.
bq. Thirdly, the times don't seem to match what I would expect. Moreover, KDD10
at 10 and 50 map tasks take exactly the same amount of time.
Ouch It's a copy and paste brain-bug !!! Ok, I'll be more careful with the next
test
bq. My expectation would have been that running 20 map tasks would do almost
twice as well as running 10 because we have 10 machines each of which is dual
core. Running 50 map tasks should be about the same as 20. We see that pattern
on KDD25 except we don't have a datapoint for 50 maps.
Re-Ouch, I used the same cofiguration that I used with the In-Mem
implementation: *mapred.tasktracker.map.tasks.maximum=1* only one mapper at a
time on each node
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
> Attachments: partial_August_10.patch, partial_August_2.patch,
> partial_August_9.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.