[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740872#action_12740872
]
Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------
as expected I found a bug and removed it. I then launched another batch of
tests on my laptop:
|| Num Map Tasks || Num Trees || Partial oob error ||
| 2 | 100 | 0.043 |
| 2 | 400 | 0.033 |
| 10 | 100 | 0.051 |
| 10 | 400 | 0.051 |
| 50 | 100 | 0.43 |
| 50 | 400 | 0.43 |
as I said in a previous comment, Partial Builder uses two step to complete its
job:
* In The first step each mapper builds a number of trees using the subset of
data available in its partition. If there are P partitions, and because of the
bagging, each tree is built using about 2/(3 x P) of the data.
* because all the instances that don't belong the a tree's partition can be
considered as oob, a second step is used to complete the oob computation. Thus
each tree is tested against 1 - 2/(3 x P) of the data
Using only the first step, I got the following results:
|| Num Map Tasks || Num Trees || Partial oob error ||
| 2 | 100 | 2.85E-4 |
| 2 | 400 | 2.67E-4 |
| 10 | 100 | 4.88E-4 |
| 10 | 400 | 2.81E-4 |
| 50 | 100 | 7.19E-4 |
| 50 | 400 | 5.46E-4 |
Although the second step passes the unit tests, there is a possibility of a bug
hiding somewhere. I'm going to use the reference implementation and run it on
subsets of the data and use the forests to classify the whole data in the same
way Partial Builder does, this should confirm if there is a bug or not.
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
> Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.