GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/5530

    [SPARK-6113] [ml] Stabilize DecisionTree API

    This is a PR for cleaning up and finalizing the DecisionTree API.  PRs for 
ensembles will follow once this is merged.
    
    ### Goal
    
    Here is the description copied from the JIRA (for both trees and ensembles):
    
    > **Issue**: The APIs for DecisionTree and ensembles (RandomForests and 
GradientBoostedTrees) have been experimental for a long time. The API has 
become very convoluted because trees and ensembles have many, many variants, 
some of which we have added incrementally without a long-term design.
    > **Proposal**: This JIRA is for discussing changes required to finalize 
the APIs. After we discuss, I will make a PR to update the APIs and make them 
non-Experimental. This will require making many breaking changes; see the 
design doc for details.
    > **[Design 
doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)**
 : This outlines current issues and the proposed API.
    
    Overall code layout:
    * The old API in mllib.tree.* will remain the same.
    * The new API will reside in ml.classification.* and ml.regression.*
    
    ### Summary of changes
    
    Old API
    * Exactly the same, except I made 1 method in Loss private (but that is not 
a breaking change since that method was introduced after the Spark 1.3 release).
    
    New APIs
    * Under Pipeline API
    * The new API preserves functionality, except:
      * New API does NOT store prob (probability of label in classification).  
I want to have it store the full vector of probabilities but feel that should 
be in a later PR.
    * Use abstractions for parameters, estimators, and models to avoid code 
duplication
    * Limit parameters to relevant algorithms
    * For enum-like types, only expose Strings
      * We can make these pluggable later on by adding new parameters.  That is 
a far-future item.
    
    Test suites
    * I organized DecisionTreeSuite, but I made absolutely no changes to the 
tests themselves.
    * The test suites for the new API only test (a) similarity with the results 
of the old API and (b) elements of the new API.
      * After code is moved to this new API, we should move the tests from the 
old suites which test the internals.
    
    ### Details
    
    #### Changed names
    
    Parameters
    * useNodeIdCache -> cacheNodeIds
    
    #### Other changes
    
    * Split: Changed categories to set instead of list
    
    #### Non-decision tree changes
    * AttributeGroup
      * Added parentheses to toMetadata, toStructField methods (These were 
removed in a previous PR, but I ran into 1 issue with the Scala compiler not 
being able to disambiguate between a toMetadata method with no parentheses and 
a toMetadata method which takes 1 argument.)
    * Attributes
      * Renamed: toMetadata -> toMetadataImpl
      * Added toMetadata methods which return ML metadata (keyed with 
“ML_ATTR”)
      * NominalAttribute: Added getNumValues method which examines both 
numValues and values.
    * Params.inheritValues: Checks whether the parent param really belongs to 
the child (to allow Estimator-Model pairs with different sets of parameters)
    
    ### Questions for reviewers
    
    * Is "DecisionTreeClassificationModel" too long a name?
    * Is this OK in the docs?
    ```
    class DecisionTreeRegressor extends 
TreeRegressor[DecisionTreeRegressionModel] with 
DecisionTreeParams[DecisionTreeRegressor] with 
TreeRegressorParams[DecisionTreeRegressor]
    ```
    
    ### Future
    
    We should open up the abstractions at some point.  E.g., it would be useful 
to be able to set tree-related parameters in 1 place and then pass those to 
multiple tree-based algorithms.
    
    Follow-up JIRAs will be (in this order):
    * Tree ensembles
    * Deprecate old tree code
    * Move DecisionTree implementation code to new API.
    * Move tests from the old suites which test the internals.
    * Update programming guide
    * Python API
    * Change RandomForest* to always use bootstrapping, even when numTrees = 1
    * Provide the probability of the predicted label for classification.  After 
we move code to the new API and update it to maintain probabilities for all 
labels, then we can add the probabilities to the new API.
    
    CC: @mengxr  @manishamde  @codedeft  @chouqin  @MechCoder

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-api-dt

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5530.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5530
    
----
commit c72c1a01387bfd0e672f71f385f9a31e9d5d2c1e
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-01T03:55:22Z

    Copied changes for common items, plus DecisionTreeClassifier from original 
PR

commit 2532c9a9189cc01592a6f9d7d49d10055f918d7c
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-02T03:08:08Z

    partial move to spark.ml API, not done yet

commit f9fbb605f503a91f4a998cd32fee11510dfd341c
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-14T05:07:15Z

    Done with DecisionTreeClassifier, but no save/load yet.  Need to add 
example as well

commit 0bdc486e426abaef4c2c3619280450601a458ab8
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-14T06:28:13Z

    fixed issues after param PR was merged

commit 119f407231f52d9339dd8b821b6fe652e6b695b8
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-14T22:11:20Z

    added DecisionTreeClassifier example

commit e11673f8994314add5d2a749c1ab808f126d2bca
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-14T23:53:03Z

    Added DecisionTreeRegressor, test suites, and example

commit 7ef63ed593cbcaa87b0078b548c6c7738499d7b3
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-14T23:53:19Z

    Added DecisionTreeRegressor, test suites, and example (for real this time)

commit f8fbd24877c522138f8d16d2c1855c498b83ba0c
Author: Joseph K. Bradley <[email protected]>
Date:   2015-04-15T16:23:58Z

    imported reorg of DecisionTreeSuite from old PR.  small cleanups

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to