[ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739386#action_12739386
 ] 

Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

I'm running some tests to compare between the *in-mem* and *partial* 
implementations. Here are the first results from my laptop (hadoop 0.19.1 in 
pseudo-distributed with 2 cores processor):

All the tests are using a random seed = 1 and only one random feature is 
selected at a time.

KDD 1%
|| Num Map Tasks || Num trees || In-Mem build time || Partial build time || 
In-Mem oob error || Partial oob error ||
| 2 | 10 |  0h 0m 21s 5 | 0h 0m 31s 823 | 8.38E-4 | 0.43 |
| 2 | 100 | 0h 0m 57s 641 | 0h 0m 44s 43 | 4.45E-4 | 0.42 |
| 2 | 200 | 0h 1m 38s 307 | 0h 1m 4s 523 | 4.45E-4 | 0.43 |
| 2 | 400 | 0h 3m 5s 883 | 0h 1m 43s 852 | 4.65E-4 | 0.42 |
| 5 | 10 | 0h 0m 28s 404 | 0h 0m 33s 374 | 8.38E-4 | 0.32 |
| 5 | 100 | 0h 1m 12s 260 | 0h 0m 43s 628 | 4.65E-4 | 0.34 |
| 5 | 200 | 0h 2m 0s 293 | 0h 0m 47s 994 | 4.45E-4 | 0.34 |
| 5 | 400 | 0h 3m 28s 69 | 0h 1m 4s 351 | 4.65E-4 | 0.34 |
| 10 | 10 | 0h 0m 42s 654 | 0h 0m 49s 785 | 7.98E-4 | 0.23 |
| 10 | 100 | 0h 1m 19s 405 | 0h 0m 53s 646 | 4.45E-4 | 0.23 |
| 10 | 200 | 0h 2m 6s 375 | 0h 0m 56s 89 | 4.65E-4 | 0.23 |
| 10 | 400 | 0h 3m 33s 253 | 0h 1m 8s 29 | 4.45E-4 | 0.23 |
| 20 | 10 |  |  |  |  |
| 20 | 100 | 0h 2m 21s 762 | 0h 1m 23s 883 | 4.04E-4 | 0.23 |
| 20 | 200 | 0h 2m 32s 952 | 0h 1m 22s 12 | 4.45E-4 | 0.23 |
| 20 | 400 | 0h 4m 4s 487 | 0h 1m 31s 248 | 4.25E-4 | 0.23 |
| 50 | 10 |  |  |  |  |
| 50 | 100 | 0h 3m 15s 485 | 0h 2m 53s 70 | 4.25E-4 | 0.23 |
| 50 | 200 | 0h 4m 2s 509 | 0h 2m 51s 733 | 4.45E-4 | 0.23 |
| 50 | 400 | 0h 5m 27s 252 | 0h 3m 7s 542 | 4.25E-4 | 0.23 |


> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to