[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Deneche A. Hakim (JIRA) Tue, 11 Aug 2009 11:18:40 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742000#action_12742000
 ]


Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

How the Partial Mapred builder works:
* step 0 (centralized): the main program prepares and launches the builder
* step 1 (mapred job): each mapper builds a set of trees and classifies the oob 
instances of the partition, return each tree with the classifications of all 
partition instances (non classified instance get -1)
* step 1-2 (centralized): the builder processes the outputs of the job two 
times:
 ** the first time in order to compute the partitions' sizes and their 
respective order
 ** the second time to extract the trees and pass the oob classifications to a 
callback
 this step has been split to avoid loading all the outputs in memory (slows 
down the program when the data is large)
* step 2 (mapred job): each mapper uses all the trees of the other partitions 
to compute the classifications for all the instances of its partition. This 
completes the oob error computation
* step 2-2 (centralized): the builder processes the outputs and passes the oob 
classifications to a callback
* step 3 (centralized): the main program receives the decision forest, and its 
callback has received all the oob classifications. In order to compute the oob 
error it must compare the oob classifications with the real data labels. 
Actually its done by loading the whole data in memory (ouch!), extracting its 
labels, then computing the oob error

in the test results the build time is the time taken by the steps 1, 1-2, 2 and 
2-2. Although the step 3 is not accounted, it slows the tests so much that I 
was not able to try KDD 100%.

In the following results, the build time is computed by the program, and I was 
able to figure out the other times using the log of the program.

EC2 10 nodes (c1.medium) cluster
mapred.tasktracker.map.tasks.maximum=2
mapred.child.java.opts=-Xms500m -Xmx1000m
export HADOOP_HEAPSIZE=2000

seed 1, m 1, oob

KDD 10%
|| Num Map Tasks || Num Trees || Oob Error || Build Time || Step 1 || Step 1-2 
|| Step 2 || Step 2-2 || Step 3 ||
| 10 | 100 | 0.0515 | 0h 0m 48s 823 | 24s | 2s | 15s | 7s | 14s |
| 10 | 200 | 0.0514 | 0h 0m 59s 34 | 27s | 3s | 15s | 14s | 13s |
| 10 | 400 | 0.0513 | 0h 1m 40s 265 | 43s | 7s | 22s | 28s | 13s |
| 20 | 100 | 0.0864 | 0h 0m 37s 366 | 15s | 1s | 14s | 7s | 14s |
| 20 | 200 | 0.1024 | 0h 0m 47s 213 | 14s | 2s | 17s | 14s | 13s |
| 20 | 400 | 0.0903 | 0h 1m 14s 368 | 18s | 4s | 22s | 30s | 13s |
| 50 | 100 | 0.4315 | 0h 0m 37s 657 | 13s | 1s | 16s | 8s | 14s |
| 50 | 200 | 0.4316 | 0h 0m 48s 611 | 15s | 2s | 16s | 15s | 14s |
| 50 | 400 | 0.4316 | 0h 1m 6s 160 | 14s | 2s | 21s | 30s | 12s |

As soon as I compile the results of KDD50 and KDD100 I'll post them, then I can 
start explaining those results (at least I will try)

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_10.patch, partial_August_2.patch, 
> partial_August_9.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Reply via email to