[ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730129#action_12730129
 ] 

Ted Dunning commented on MAHOUT-145:
------------------------------------


What do you think about using a normal mapper structure where the map() method 
reads one line at a time, stores the record into memory and then does the tree 
building in the close() method of your mapper?

This trick is used extensively in streaming.  If you are using 0.18.* then you 
have to stash the output collector in an instance variable so that you can 
produce output (or just open a task specific output file).  In 0.20, I think 
that the Context argument is passed to the close method to avoid that need.  
Because production of output in the close() is so important to some 
applications, you are guaranteed to be able to use the output collector in 
close().

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to