[ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730065#action_12730065
 ] 

Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------

A possible implementation is as follows:

* Use a custom *InputFormat* similar to *TextInputFormat* that returns all the 
lines of a split at ones in a *Text* or, better, a custom *Writable* that holds 
a String[].
* the mapper simply converts the input lines to a *Data* instance and uses the 
reference implementation to build a tree.

The custom *InputFormat* can be either a specialized *NLineInputFormat* with a 
custom *RecordReader* that returns all the lines of a split at ones; or inherit 
from *FileInputFormat* and uses the same custom *RecordReader*.
The advantage of inheriting from *NLineInputFormat* is that it is easy to 
configure the number of lines (instances) to grow each tree, but reads all the 
data when generating the splits thus can slow down the implementation because 
the generation of the splits is done in the client machine.


> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to