[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Deneche A. Hakim (JIRA) Sun, 19 Jul 2009 04:10:39 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732990#action_12732990
 ]


Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

In the partial implementation, the input of the program is the data and T 
(number of trees). The data is split up between the mappers, but how many trees 
each mapper should build ?

I've got two ideas:
* [easiest] each mapper builds T trees on its subset of the data, this makes it 
easy to configure how many trees each mapper builds but its somewhat tricky to 
estimate the total number of trees because it will depend on 
FileInputFormat.getsplits() (min split size, block size, data size...)
* each mapper builds T/M trees where M is the number of mappers available. The 
user sets the total number of trees, and the number of trees that each mapper 
builds will depend on the number of splits

any suggestion ?

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-145) PartialData mapreduce Random Forests

Reply via email to