GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/1582
[mllib] Decision Tree API update and multiclass bug fix
Summary:
(1) Split DecisionTree API into separate Classifier and Regressor classes.
(2) Bug fixes for recent multiclass PR
(https://github.com/apache/spark/pull/886)
Details on (1) API:
(1a) Split classes: E.g.: DecisionTree --> DecisionTreeClassifier and
DecisionTreeRegressor
(1b) Included print() function for human-readable model descriptions
(1c) Renamed Strategy to *Params. Changed to take strings instead of
special types.
(1d) Made configuration classes (Impurity, QuantileStrategy) private to
mllib.
(1e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
(1f) Removed static train() functions in favor of using Params classes.
(1g) Introduced DatasetInfo class for metadata.
Details on (2) bug fixes:
(2a) Inconsistent aggregate (agg) indexing for unordered features.
(2b) Fixed gain calculations for edge cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark decisiontree-api
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1582.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1582
----
commit 929f0e648962fd0e0529ac2f40452c7302eed733
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-18T21:34:59Z
updating DT APIf
commit 29e29b8c132fd08649d71807d60dc1eb369a6ea5
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-18T21:54:07Z
Merging multiclass DT PR, plus others, into branch with updates to DT API.
commit 20fc8057e912c8cc1266cbb39ce0285907e7356b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-19T23:37:50Z
Mostly done with DecisionTree API re-config. Still need to update
DecisionTreeRegressor class,object, update docs, tests and examples.
commit 0ced13a5773e2973042a580a03ab4a9457fe3fe8
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-23T20:07:55Z
Major changes to DecisionTree API and internals. Unit tests work. Still
need to update documentation.
Split classes:
* DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
* DecisionTreeModel --> DecisionTreeClassifierModel,
DecisionTreeRegressorModel
* Super-classes DecisionTree, DecisionTreeModel are private to mllib.
Included print() function for human-readable model descriptions
* For: DecisionTreeClassifierModel, DecisionTreeRegressorModel, Node
parameters (used to be named Strategy)
* Split into: DTParams, DTClassifierParams, DTRegressorParams.
* Added defaultParams() method to DecisionTreeClassifier/Regressor.
* impurity
** Made private to mllib package.
** Split Impurity into ClassifierImpurity, RegressorImpurity
** Added factories: ClassifierImpurities, RegressorImpurities
* QuantileStrategy: Added factory QuantileStrategies
* maxDepth: Changed meaning by 1. Previously, depth = 1 meant 1 leaf node;
now it means 1 internal and 2 leaf nodes. This matches scikit-learn and rpart.
train() functions:
* Changed to use DatasetInfo class for metadata.
* Eliminated many of the static train() functions to prevent users from
needing to remember the order of long lists of parameters.
DecisionTree internals:
* renamed numSplits to numBins (since it was a duplicate name)
commit 4ba347fa2bce4b714478680a10442d26e6972ffc
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-23T20:10:21Z
Merge remote-tracking branch 'upstream/master' into decisiontree-api
commit a853bfc1929e9d1fb56d955241c827fd2a5c1351
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-23T20:25:05Z
Last non-merge commit said it changed the maxDepth meaning, but it did not.
This one implements this change:
maxDepth: Changed meaning by 1. Previously, depth = 1 meant 1 leaf node;
now it means 1 internal and 2 leaf nodes. This matches scikit-learn and rpart.
Internally, this meant replacing: maxDepth <â maxDepth+1.
In tests, decremented maxDepth by 1.
commit 45068442dbcf36548d32001d60f9d4bda68c6a87
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-24T01:06:36Z
Changed all config/impurity classes/objects to be private[mllib].
Changed Params classes to take strings instead of special types.
Made impurity names lists publicly accessible via Params classes.
Simplified impurity factories.
commit b6b0809249a81e950f87b0a7f2c389f6c5d08f98
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-24T01:07:26Z
removed
mllib/src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java
since it fails currently
commit a2a93115a1f2106e13bb122589a28669310d19f5
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-24T01:07:26Z
removed
mllib/src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java
since it fails currently
Comments which should have been added to previous commit:
Fixed one test in DecisionTreeSuite to undo a change in previous commit
(âstump with categorical variables for multiclass classificationâ).
Reverted impurity from Entropy back to Gini.
Java compatibility:
* Changed non-static train() methodsâ names to run() to avoid conflicts
with static train() methods in Java.
* Added setter functions to *Params classes.
commit 0cb9866ab5a7e70663358263aea3ae1136c3b19b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-24T01:09:39Z
Merge branch 'decisiontree-api' of github.com:jkbradley/spark into
decisiontree-api
commit 3ff5027c8fc7cd3e5a84233ceb763dc905ec6cc0
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-24T22:31:04Z
Bug fix: Indexing was inconsistent for aggregate calculations for unordered
features (in multiclass classification with categorical features, where the
features had few enough values such that they could be considered unordered,
i.e., isSpaceSufficientForAllCategoricalSplits=true).
* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue,
binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,â¦, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).
* Corrected this bug by changing updateBinForUnorderedFeature to use the
second indexing pattern.
Unit tests in DecisionTreeSuite
* Updated a few tests to train a model and test its training accuracy,
which catches the indexing bug from updateBinForUnorderedFeature() discussed
above.
* Added new test (âstump with categorical variables for multiclass
classification, with just enough binsâ) to test bin extremes.
Bug fix: calculateGainForSplit (for classification):
* It used to return dummy prediction values when either the right or left
children had 0 weight. These were incorrect for multiclass classification. It
has been corrected.
Updated impurities to allow for count = 0. This was related to the above
bug fix for calculateGainForSplit (for classification).
Small updates to documentation and coding style.
commit 3ba5b4c5692bc0e769a0fb76f382c32bf3db6292
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-25T00:17:07Z
Merge remote-tracking branch 'upstream/master' into decisiontree-api
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---