GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1582

    [mllib] Decision Tree API update and multiclass bug fix

    Summary:
     (1) Split DecisionTree API into separate Classifier and Regressor classes.
     (2) Bug fixes for recent multiclass PR 
(https://github.com/apache/spark/pull/886)
    
    Details on (1) API:
    
    (1a) Split classes:  E.g.: DecisionTree --> DecisionTreeClassifier and 
DecisionTreeRegressor
    (1b) Included print() function for human-readable model descriptions
    (1c) Renamed Strategy to *Params.  Changed to take strings instead of 
special types.
    (1d) Made configuration classes (Impurity, QuantileStrategy) private to 
mllib.
    (1e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
    (1f) Removed static train() functions in favor of using Params classes.
    (1g) Introduced  DatasetInfo class for metadata.
    
    Details on (2) bug fixes:
    
    (2a) Inconsistent aggregate (agg) indexing for unordered features.
    (2b) Fixed gain calculations for edge cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark decisiontree-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1582.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1582
    
----
commit 929f0e648962fd0e0529ac2f40452c7302eed733
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-18T21:34:59Z

    updating DT APIf

commit 29e29b8c132fd08649d71807d60dc1eb369a6ea5
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-18T21:54:07Z

    Merging multiclass DT PR, plus others, into branch with updates to DT API.

commit 20fc8057e912c8cc1266cbb39ce0285907e7356b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-19T23:37:50Z

    Mostly done with DecisionTree API re-config.  Still need to update 
DecisionTreeRegressor class,object, update docs, tests and examples.

commit 0ced13a5773e2973042a580a03ab4a9457fe3fe8
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-23T20:07:55Z

    Major changes to DecisionTree API and internals.  Unit tests work.  Still 
need to update documentation.
    
    Split classes:
    * DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
    * DecisionTreeModel --> DecisionTreeClassifierModel, 
DecisionTreeRegressorModel
    * Super-classes DecisionTree, DecisionTreeModel are private to mllib.
    
    Included print() function for human-readable model descriptions
    * For: DecisionTreeClassifierModel, DecisionTreeRegressorModel, Node
    
    parameters (used to be named Strategy)
    * Split into: DTParams, DTClassifierParams, DTRegressorParams.
    * Added defaultParams() method to DecisionTreeClassifier/Regressor.
    * impurity
    ** Made private to mllib package.
    ** Split Impurity into ClassifierImpurity, RegressorImpurity
    ** Added factories: ClassifierImpurities, RegressorImpurities
    * QuantileStrategy: Added factory QuantileStrategies
    * maxDepth: Changed meaning by 1.  Previously, depth = 1 meant 1 leaf node; 
now it means 1 internal and 2 leaf nodes.  This matches scikit-learn and rpart.
    
    train() functions:
    * Changed to use DatasetInfo class for metadata.
    * Eliminated many of the static train() functions to prevent users from 
needing to remember the order of long lists of parameters.
    
    DecisionTree internals:
    * renamed numSplits to numBins (since it was a duplicate name)

commit 4ba347fa2bce4b714478680a10442d26e6972ffc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-23T20:10:21Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-api

commit a853bfc1929e9d1fb56d955241c827fd2a5c1351
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-23T20:25:05Z

    Last non-merge commit said it changed the maxDepth meaning, but it did not.
    This one implements this change:
    
    maxDepth: Changed meaning by 1.  Previously, depth = 1 meant 1 leaf node; 
now it means 1 internal and 2 leaf nodes.  This matches scikit-learn and rpart.
    Internally, this meant replacing: maxDepth <— maxDepth+1.
    In tests, decremented maxDepth by 1.

commit 45068442dbcf36548d32001d60f9d4bda68c6a87
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-24T01:06:36Z

    Changed all config/impurity classes/objects to be private[mllib].
    Changed Params classes to take strings instead of special types.
    Made impurity names lists publicly accessible via Params classes.
    Simplified impurity factories.

commit b6b0809249a81e950f87b0a7f2c389f6c5d08f98
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-24T01:07:26Z

    removed 
mllib/src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java 
since it fails currently

commit a2a93115a1f2106e13bb122589a28669310d19f5
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-24T01:07:26Z

    removed 
mllib/src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java 
since it fails currently
    
    Comments which should have been added to previous commit:
    
    Fixed one test in DecisionTreeSuite to undo a change in previous commit 
(“stump with categorical variables for multiclass classification”).  
Reverted impurity from Entropy back to Gini.
    
    Java compatibility:
    * Changed non-static train() methods’ names to run() to avoid conflicts 
with static train() methods in Java.
    * Added setter functions to *Params classes.

commit 0cb9866ab5a7e70663358263aea3ae1136c3b19b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-24T01:09:39Z

    Merge branch 'decisiontree-api' of github.com:jkbradley/spark into 
decisiontree-api

commit 3ff5027c8fc7cd3e5a84233ceb763dc905ec6cc0
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-24T22:31:04Z

    Bug fix: Indexing was inconsistent for aggregate calculations for unordered 
features (in multiclass classification with categorical features, where the 
features had few enough values such that they could be considered unordered, 
i.e., isSpaceSufficientForAllCategoricalSplits=true).
    * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
binIndex), where
    ** featureValue was from arr (so it was a feature value)
    ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
    * The rest of the code indexed agg as (node, feature, binIndex, label).
    * Corrected this bug by changing updateBinForUnorderedFeature to use the 
second indexing pattern.
    
    Unit tests in DecisionTreeSuite
    * Updated a few tests to train a model and test its training accuracy, 
which catches the indexing bug from updateBinForUnorderedFeature() discussed 
above.
    * Added new test (“stump with categorical variables for multiclass 
classification, with just enough bins”) to test bin extremes.
    
    Bug fix: calculateGainForSplit (for classification):
    * It used to return dummy prediction values when either the right or left 
children had 0 weight.  These were incorrect for multiclass classification.  It 
has been corrected.
    
    Updated impurities to allow for count = 0.  This was related to the above 
bug fix for calculateGainForSplit (for classification).
    
    Small updates to documentation and coding style.

commit 3ba5b4c5692bc0e769a0fb76f382c32bf3db6292
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-25T00:17:07Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-api

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to