[
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim updated MAHOUT-145:
------------------------------------
Attachment: partial_August_2.patch
partial-mapred implementation
*changes*
* abstract class org.mahout.rf.mapred.Builder : Base class for Mapred Random
Forest builders. Takes care of storing the parameters common to the mapred
implementations: tree builder, data path, dataset path and seed. The child
classes must implement at least :
** void configureJob(JobConf) : to further configure the job before its
launch; and
** RandomForest parseOutput(JobConf, PredictionCallback) in order to convert
the job outputs into a RandomForest and its corresponding oob predictions
* abstract class org.mahout.rf.mapred.MapredMapper : Base class for Mapred
mappers. Loads common parameters from the job
* org.mahout.rf.mapred.examples.BuildForest : can now build a forest using
either the in-mem or partial implementations (mapred or sequential)
has also a special mode (-c command-line option) that checks if the results
of the mapred vs. sequential implementations are the same, I use it to test the
implementations
because when using JUnit Hadoop uses a Local runner with just one mapper
* one important change concerns the Dataset class. This class describes the
data attributes. I added a tool (org.apache.mahout.rf.tools.Describe) that
takes a data path, and a weird description string then it generates a Dataset
and stores it in a file. This file is then passed to the various builders
allowing them to convert the data instances in the fly. For example, the KDD
description is : "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" (I told you, its weird!!!)
that means that :
** the first attribute is Numerical
** the 3 next attributes are Categorical
** the 2 next attributes are Numerical
** ...
** the last attribute is the Label
package org.apache.mahout.rf.mapred.partial
* InterResults : Utility class that stores/loads the intermediate results
passed from the 1st to the 2nd step of the partial implementation
* PartialBuilder : inherits from Builder and builds the forest by splitting the
data to the mappers. Runs in two steps:
** in the first step each mapper receives a subset of the data with its input
split, builds a given number of trees, returning each tree with the
classifications of the instances of the mapper's split that are oob;
** in the second step each mapper receives the trees generated by the first
step and computes for each tree, that does not belong to the mapper's
partition, the classifications of all the instances of the mapper's split
PartialBuilder goes through the final step results and passes the
classifications to a given PredictionCallback, allowing the calling code to
compute the oob error estimate.
* Step1Mapper : First step mapper. Builds the trees using the data available in
the InputSplit. Predict the oob classes for each tree in its growing partition
(input split).
* PartialSequentialBuilder : Simulates the Partial mapreduce implementation in
a sequential manner, useful when testing the implementation performances
* Step2Job : 2nd step of the partial mapreduce builder. Computes the oob
predictions using all the trees of the forest
* Step2Mapper : Second step mapper. Using the trees of the first step, computes
the oob predictions for each tree, except those of its own partition, on all
instancesof the partition.
* TreeID: inherits from LongWritable, allows to combine a partition integer and
a treeId integer into a single LongWritable. Used by the first and second step
to identify uniquely each tree of the forest and to wich partition it belongs.
> PartialData mapreduce Random Forests
> ------------------------------------
>
> Key: MAHOUT-145
> URL: https://issues.apache.org/jira/browse/MAHOUT-145
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Reporter: Deneche A. Hakim
> Priority: Minor
> Attachments: partial_August_2.patch
>
>
> This implementation is based on a suggestion by Ted:
> "modify the original algorithm to build multiple trees for different portions
> of the data. That loses some of the solidity of the original method, but
> could actually do better if the splits exposed non-stationary behavior."
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.