[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732990#action_12732990
]
Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------
In the partial implementation, the input of the program is the data and T
(number of trees). The data is split up between the mappers, but how many trees
each mapper should build ?
I've got two ideas:
* [easiest] each mapper builds T trees on its subset of the data, this makes it
easy to configure how many trees each mapper builds but its somewhat tricky to
estimate the total number of trees because it will depend on
FileInputFormat.getsplits() (min split size, block size, data size...)
* each mapper builds T/M trees where M is the number of mappers available. The
user sets the total number of trees, and the number of trees that each mapper
builds will depend on the number of splits
any suggestion ?
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.