[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Ted Dunning (JIRA) Wed, 05 Aug 2009 14:50:40 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739775#action_12739775
 ]


Ted Dunning commented on MAHOUT-145:
------------------------------------

Ouch!

|| Num Map Tasks || Num trees || In-Mem build time || Partial build time || 
In-Mem oob error || Partial oob error ||
| ...|
| 2 | 100 | 0h 0m 57s 641 | 0h 0m 44s 43 | 4.45E-4 | 0.42 |
| ... |
| 10 | 400 | 0h 3m 33s 253 | 0h 1m 8s 29 | 4.45E-4 | 0.23 |

This looks like it runs faster (or at least not much slower), but produces 
astronomically worse results.  

What really bugs me is that it is worse with few maps.  Am I interpreting this 
correctly when I say that splitting the data in half and building independent 
forests increases OOB errors by a factor of 1000?  How could that possibly be?



> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Reply via email to