[GitHub] spark pull request: [SPARK-6113] [mllib] [WIP] Stabilize DecisionT...

jkbradley Fri, 13 Mar 2015 00:53:27 -0700

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/5009


    [SPARK-6113] [mllib] [WIP] Stabilize DecisionTree and ensembles APIs

    This is a **WIP** PR for cleaning up and finalizing the DecisionTree and 
tree ensembles APIs.
    
    *Please discuss overall design on the JIRA and implementation details here 
in this PR.*
    
    ### Summary
    
    Here is the description copied from the JIRA:
    
    > **Issue**: The APIs for DecisionTree and ensembles (RandomForests and 
GradientBoostedTrees) have been experimental for a long time. The API has 
become very convoluted because trees and ensembles have many, many variants, 
some of which we have added incrementally without a long-term design.
    > **Proposal**: This JIRA is for discussing changes required to finalize 
the APIs. After we discuss, I will make a PR to update the APIs and make them 
non-Experimental. This will require making many breaking changes; see the 
design doc for details.
    > **[Design 
doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)**
 : This outlines current issues and the proposed API.
    
    Overall code layout:
    * The old API in mllib.tree.* will remain the same.
    * The new API will reside in mllib.classification.* and mllib.regression.*
    
    ### What to review
    
    Currently, the only items to review are the 2 examples and their 
demonstration of setting parameters and calling run():
    * NewDT.scala
    * JavaNewDT.java
    
    Note these 2 examples compile and run.  You may find it interesting to see 
how parameters are implementing in traits; this is based on the ml.param.Param 
implementation and the ongoing dev list discussion on enum-like types.  Please 
do NOT bother to comment on details or messy things yet.
    
    Current questions:
    * Should we rename âFeatureSubsetStrategy?â
    * Is "DecisionTreeClassificationModel" too long a name?
    * How should users select parameter options?  See examples, and also see 
doc images I will paste below:
      * RandomForestClassifier.featureSubsetStrategies.Auto/All/etc.
      * DecisionTreeClassifier.Entropy/Gini
    
    ### Notes
    
    #### FeatureSubsetStrategy options
    
    Another option for FeatureSubsetStrategy would be something like:
    ```
    import org.apache.spark.mllib.tree.FeatureSubsetStrategy
    ...
    rf.setFeatureSubsetStrategy(FeatureSubsetStrategy.Fraction(0.2))
    ```
    The main issue is that FeatureSubsetStrategy is shared across 
classification and regression, so we must either (a) import it from a shared 
subpackage like o.a.s.mllib.tree or (b) make it available in both 
o.a.s.mllib.classification and o.a.s.mllib.regression.
    
    CC: @mengxr  @manishamde  @codedeft  @chouqin 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5009.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5009
    
----
commit a335cb7f77d1a1cd2c40ee1cc7fdfff36afe7081
Author: Joseph K. Bradley <[email protected]>
Date:   2015-03-13T07:30:46Z

    Initial sketch of new tree API, especially parameters

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6113] [mllib] [WIP] Stabilize DecisionT...

Reply via email to