[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730129#action_12730129
]
Ted Dunning commented on MAHOUT-145:
------------------------------------
What do you think about using a normal mapper structure where the map() method
reads one line at a time, stores the record into memory and then does the tree
building in the close() method of your mapper?
This trick is used extensively in streaming. If you are using 0.18.* then you
have to stash the output collector in an instance variable so that you can
produce output (or just open a task specific output file). In 0.20, I think
that the Context argument is passed to the close method to avoid that need.
Because production of output in the close() is so important to some
applications, you are guaranteed to be able to use the output collector in
close().
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.