[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742000#action_12742000
]
Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------
How the Partial Mapred builder works:
* step 0 (centralized): the main program prepares and launches the builder
* step 1 (mapred job): each mapper builds a set of trees and classifies the oob
instances of the partition, return each tree with the classifications of all
partition instances (non classified instance get -1)
* step 1-2 (centralized): the builder processes the outputs of the job two
times:
** the first time in order to compute the partitions' sizes and their
respective order
** the second time to extract the trees and pass the oob classifications to a
callback
this step has been split to avoid loading all the outputs in memory (slows
down the program when the data is large)
* step 2 (mapred job): each mapper uses all the trees of the other partitions
to compute the classifications for all the instances of its partition. This
completes the oob error computation
* step 2-2 (centralized): the builder processes the outputs and passes the oob
classifications to a callback
* step 3 (centralized): the main program receives the decision forest, and its
callback has received all the oob classifications. In order to compute the oob
error it must compare the oob classifications with the real data labels.
Actually its done by loading the whole data in memory (ouch!), extracting its
labels, then computing the oob error
in the test results the build time is the time taken by the steps 1, 1-2, 2 and
2-2. Although the step 3 is not accounted, it slows the tests so much that I
was not able to try KDD 100%.
In the following results, the build time is computed by the program, and I was
able to figure out the other times using the log of the program.
EC2 10 nodes (c1.medium) cluster
mapred.tasktracker.map.tasks.maximum=2
mapred.child.java.opts=-Xms500m -Xmx1000m
export HADOOP_HEAPSIZE=2000
seed 1, m 1, oob
KDD 10%
|| Num Map Tasks || Num Trees || Oob Error || Build Time || Step 1 || Step 1-2
|| Step 2 || Step 2-2 || Step 3 ||
| 10 | 100 | 0.0515 | 0h 0m 48s 823 | 24s | 2s | 15s | 7s | 14s |
| 10 | 200 | 0.0514 | 0h 0m 59s 34 | 27s | 3s | 15s | 14s | 13s |
| 10 | 400 | 0.0513 | 0h 1m 40s 265 | 43s | 7s | 22s | 28s | 13s |
| 20 | 100 | 0.0864 | 0h 0m 37s 366 | 15s | 1s | 14s | 7s | 14s |
| 20 | 200 | 0.1024 | 0h 0m 47s 213 | 14s | 2s | 17s | 14s | 13s |
| 20 | 400 | 0.0903 | 0h 1m 14s 368 | 18s | 4s | 22s | 30s | 13s |
| 50 | 100 | 0.4315 | 0h 0m 37s 657 | 13s | 1s | 16s | 8s | 14s |
| 50 | 200 | 0.4316 | 0h 0m 48s 611 | 15s | 2s | 16s | 15s | 14s |
| 50 | 400 | 0.4316 | 0h 1m 6s 160 | 14s | 2s | 21s | 30s | 12s |
As soon as I compile the results of KDD50 and KDD100 I'll post them, then I can
start explaining those results (at least I will try)
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
> Attachments: partial_August_10.patch, partial_August_2.patch,
> partial_August_9.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.