[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Deneche A. Hakim (JIRA) Mon, 10 Aug 2009 11:31:38 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741470#action_12741470
 ]


Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

Here are some results from a 10 nodes cluster (c1.medium):

|| Dataset || Num Map Tasks || Num Trees || Build Time || oob error ||
| KDD 10% | 10 | 400 | 0h 1m 46s 19 | 0.051 |
| KDD 10% | 20 | 400 | 0h 1m 15s 571 | 0.090 |
| KDD 10% | 50 | 400 | 0h 1m 46s 19 | 0.051 |
| KDD 25% | 10 | 100 | 0h 1m 18s 574 | 0.43 |
| KDD 25% | 10 | 400 | 0h 4m 9s 999 | 0.019 |
| KDD 25% | 20 | 400 | 0h 2m 42s 293 | 0.50 |

having some heap size issues, I set  HADOOP_HEAPSIZE=2000 for the next tests:
|| Dataset || Num Map Tasks || Num Trees || Build Time || oob error ||
| KDD 50% | 10 | 100 | 0h 1m 52s 338 | 0.19 |
| KDD 50% | 20 | 400 | 0h 5m 54s 961 | 0.18 |
| KDD 50% | 50 | 400 | 0h 4m 18s 861 | 0.47 |

For now I'm not able to process KDD 100% because a limitation in my code. The 
Partial Builder takes 6 minutes to build 100 with 10 maps, but the example 
program hangs when comparing the forest predictions with the data labels, 
because the current example code loads the whole dataset in memory before 
checking the labels =P

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_10.patch, partial_August_2.patch, 
> partial_August_9.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Reply via email to