[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730065#action_12730065
]
Deneche A. Hakim commented on MAHOUT-145:
-----------------------------------------
A possible implementation is as follows:
* Use a custom *InputFormat* similar to *TextInputFormat* that returns all the
lines of a split at ones in a *Text* or, better, a custom *Writable* that holds
a String[].
* the mapper simply converts the input lines to a *Data* instance and uses the
reference implementation to build a tree.
The custom *InputFormat* can be either a specialized *NLineInputFormat* with a
custom *RecordReader* that returns all the lines of a split at ones; or inherit
from *FileInputFormat* and uses the same custom *RecordReader*.
The advantage of inheriting from *NLineInputFormat* is that it is easy to
configure the number of lines (instances) to grow each tree, but reads all the
data when generating the splits thus can slow down the implementation because
the generation of the splits is done in the client machine.
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.