[ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim updated MAHOUT-145:
------------------------------------

    Attachment: partial_August_2.patch

partial-mapred implementation

*changes*

* abstract class org.mahout.rf.mapred.Builder : Base class for Mapred Random 
Forest builders. Takes care of storing the parameters common to the mapred 
implementations: tree builder, data path, dataset path and seed. The child 
classes must implement at least :
 ** void configureJob(JobConf) : to further configure the job before its 
launch; and
 ** RandomForest parseOutput(JobConf, PredictionCallback) in order to convert 
the job outputs into a RandomForest and its corresponding oob predictions
* abstract class org.mahout.rf.mapred.MapredMapper : Base class for Mapred 
mappers. Loads common parameters from the job
* org.mahout.rf.mapred.examples.BuildForest : can now build a forest using 
either the in-mem or partial implementations (mapred or sequential)
  has also a special mode (-c command-line option) that checks if the results 
of the mapred vs. sequential implementations are the same, I use it to test the 
implementations
  because when using JUnit Hadoop uses a Local runner with just one mapper
* one important change concerns the Dataset class. This class describes the 
data attributes. I added a tool (org.apache.mahout.rf.tools.Describe) that 
takes a data path, and a weird description string then it generates a Dataset 
and stores it in a file. This file is then passed to the various builders 
allowing them to convert the data instances in the fly. For example, the KDD 
description is : "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" (I told you, its weird!!!) 
that means that :
 ** the first attribute is Numerical
 ** the 3 next attributes are Categorical
 ** the 2 next attributes are Numerical
 ** ...
 ** the last attribute is the Label

package org.apache.mahout.rf.mapred.partial
* InterResults : Utility class that stores/loads the intermediate results 
passed from the 1st to the 2nd step of the partial implementation
* PartialBuilder : inherits from Builder and builds the forest by splitting the 
data to the mappers. Runs in two steps:
 ** in the first step each mapper receives a subset of the data with its input 
split, builds a given number of trees, returning each tree with the 
classifications of the instances of the mapper's split that are oob;
 ** in the second step each mapper receives the trees generated by the first 
step and computes for each tree, that does not belong to the mapper's 
partition, the classifications of all the instances of the mapper's split
 PartialBuilder goes through the final step results and passes the 
classifications to a given PredictionCallback, allowing the calling code to 
compute the oob error estimate.
* Step1Mapper : First step mapper. Builds the trees using the data available in 
the InputSplit. Predict the oob classes for each tree in its growing partition 
(input split).
* PartialSequentialBuilder : Simulates the Partial mapreduce implementation in 
a sequential manner, useful when testing the implementation performances
* Step2Job : 2nd step of the partial mapreduce builder. Computes the oob 
predictions using all the trees of the forest
* Step2Mapper : Second step mapper. Using the trees of the first step, computes 
the oob predictions for each tree, except those of its own partition, on all 
instancesof the partition.
* TreeID: inherits from LongWritable, allows to combine a partition integer and 
a treeId integer into a single LongWritable. Used by the first and second step 
to identify uniquely each tree of the forest and to wich partition it belongs.

> PartialData mapreduce Random Forests
> ------------------------------------
>
>                 Key: MAHOUT-145
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-145
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions 
> of the data. That loses some of the solidity of the original method, but 
> could actually do better if the splits exposed non-stationary behavior."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to