[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739386#action_12739386
]
Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------
I'm running some tests to compare between the *in-mem* and *partial*
implementations. Here are the first results from my laptop (hadoop 0.19.1 in
pseudo-distributed with 2 cores processor):
All the tests are using a random seed = 1 and only one random feature is
selected at a time.
KDD 1%
|| Num Map Tasks || Num trees || In-Mem build time || Partial build time ||
In-Mem oob error || Partial oob error ||
| 2 | 10 | 0h 0m 21s 5 | 0h 0m 31s 823 | 8.38E-4 | 0.43 |
| 2 | 100 | 0h 0m 57s 641 | 0h 0m 44s 43 | 4.45E-4 | 0.42 |
| 2 | 200 | 0h 1m 38s 307 | 0h 1m 4s 523 | 4.45E-4 | 0.43 |
| 2 | 400 | 0h 3m 5s 883 | 0h 1m 43s 852 | 4.65E-4 | 0.42 |
| 5 | 10 | 0h 0m 28s 404 | 0h 0m 33s 374 | 8.38E-4 | 0.32 |
| 5 | 100 | 0h 1m 12s 260 | 0h 0m 43s 628 | 4.65E-4 | 0.34 |
| 5 | 200 | 0h 2m 0s 293 | 0h 0m 47s 994 | 4.45E-4 | 0.34 |
| 5 | 400 | 0h 3m 28s 69 | 0h 1m 4s 351 | 4.65E-4 | 0.34 |
| 10 | 10 | 0h 0m 42s 654 | 0h 0m 49s 785 | 7.98E-4 | 0.23 |
| 10 | 100 | 0h 1m 19s 405 | 0h 0m 53s 646 | 4.45E-4 | 0.23 |
| 10 | 200 | 0h 2m 6s 375 | 0h 0m 56s 89 | 4.65E-4 | 0.23 |
| 10 | 400 | 0h 3m 33s 253 | 0h 1m 8s 29 | 4.45E-4 | 0.23 |
| 20 | 10 | | | | |
| 20 | 100 | 0h 2m 21s 762 | 0h 1m 23s 883 | 4.04E-4 | 0.23 |
| 20 | 200 | 0h 2m 32s 952 | 0h 1m 22s 12 | 4.45E-4 | 0.23 |
| 20 | 400 | 0h 4m 4s 487 | 0h 1m 31s 248 | 4.25E-4 | 0.23 |
| 50 | 10 | | | | |
| 50 | 100 | 0h 3m 15s 485 | 0h 2m 53s 70 | 4.25E-4 | 0.23 |
| 50 | 200 | 0h 4m 2s 509 | 0h 2m 51s 733 | 4.45E-4 | 0.23 |
| 50 | 400 | 0h 5m 27s 252 | 0h 3m 7s 542 | 4.25E-4 | 0.23 |
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
> Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.