[GitHub] spark pull request: [SPARK-4434] [MLlib] add python api for random...

davies Mon, 17 Nov 2014 11:24:01 -0800

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/3320


    [SPARK-4434] [MLlib] add python api for random forest

    ```
        class WeightedEnsembleModel
         |  A model trained by RandomForest
         |
         |  numWeakHypotheses(self)
         |      Get number of trees in forest.
         |
         |  predict(self, x)
         |      Predict values for a single data point or an RDD of points 
using the model trained.
         |
         |  toDebugString(self)
         |      Full model
         |
         |  totalNumNodes(self)
         |      Get total number of nodes, summed over all trees in the forest.
         |
    
        class RandomForest
         |  trainClassifier(cls, data, numClassesForClassification, 
categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', 
impurity='gini', maxDepth=4, maxBins=32, seed=None):
         |      Method to train a decision tree model for binary or multiclass 
classification.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels should take values {0, 1, ..., 
numClasses-1}.
         |      :param numClassesForClassification: number of classes for 
classification.
         |      :param categoricalFeaturesInfo: Map storing arity of 
categorical features.
         |                                  E.g., an entry (n -> k) indicates 
that feature n is categorical
         |                                  with k categories indexed from 0: 
{0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider 
for splits at each node.
         |                                Supported: "auto" (default), "all", 
"sqrt", "log2", "onethird".
         |                                If "auto" is set, this parameter is 
set based on numTrees:
         |                                  if numTrees == 1, set to "all";
         |                                  if numTrees > 1 (forest) set to 
"sqrt" for classification and
         |                                    to "onethird" for regression.
         |      :param impurity: Criterion used for information gain 
calculation.
         |                   Supported values: "gini" (recommended) or 
"entropy".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 
1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes. (default: 4)
         |      :param maxBins: maximum number of bins used for splitting 
features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing 
feature subsets.
         |      :return: WeightedEnsembleModel that can be used for prediction
         |
         |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, 
featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, 
seed=None):
         |      Method to train a decision tree model for regression.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels are real numbers.
         |      :param categoricalFeaturesInfo: Map storing arity of 
categorical features.
         |                                   E.g., an entry (n -> k) indicates 
that feature n is categorical
         |                                   with k categories indexed from 0: 
{0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider 
for splits at each node.
         |                                 Supported: "auto" (default), "all", 
"sqrt", "log2", "onethird".
         |                                 If "auto" is set, this parameter is 
set based on numTrees:
         |                                 if numTrees == 1, set to "all";
         |                                 if numTrees > 1 (forest) set to 
"sqrt" for classification and
         |                                 to "onethird" for regression.
         |      :param impurity: Criterion used for information gain 
calculation.
         |                       Supported values: "variance".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 
1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes.(default: 4)
         |      :param maxBins: maximum number of bins used for splitting 
features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing 
feature subsets.
         |      :return: WeightedEnsembleModel that can be used for prediction
         |
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark forest

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3320.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3320
    
----
commit 565d47627953bd5e420b81d48a9a80afe4e6f66b
Author: Davies Liu <[email protected]>
Date:   2014-11-17T19:18:07Z

    add python api for random forest

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4434] [MLlib] add python api for random...

Reply via email to