GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/1727
[SPARK-2478] [mllib] DecisionTree Python API
Added experimental Python API for Decision Trees.
API:
* class DecisionTreeModel
** predict() for single examples and RDDs, taking both feature vectors and
LabeledPoints
** numNodes()
** depth()
** __str__()
* class DecisionTree
** trainClassifier()
** trainRegressor()
** train()
Examples and testing:
* Added example testing classification and regression with batch
prediction: examples/src/main/python/mllib/tree.py
* Have also tested example usage in doc of python/pyspark/mllib/tree.py
which tests single-example prediction with dense and sparse vectors
Also: Small bug fix in python/pyspark/mllib/_common.py: In
_linear_predictor_typecheck, changed check for RDD to use isinstance() instead
of type() in order to catch RDD subclasses.
CC @mengxr @manishamde
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark decisiontree-python-new
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1727.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1727
----
commit f8253520045d90c75b143d810edbb746f86cad8c
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-30T21:48:41Z
Wrote Python API and example for DecisionTree. Also added toString, depth,
and numNodes methods to DecisionTreeModel.
commit 5f920a10b6114baa0744f55843969843b1f2babc
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-30T22:24:55Z
Demonstration of bug before submitting fix: Updated DecisionTreeSuite so
that 3 tests fail. Will describe bug in next commit.
commit 73fbea2b2a921111cf22f4d9c76ea23c6a4f7afe
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-30T22:52:22Z
Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
commit 2283df878178d3b8c86ecde1d4220076af25b72f
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-30T22:53:14Z
2 bug fixes.
Indexing was inconsistent for aggregate calculations for unordered features
(in multiclass classification with categorical features, where the features had
few enough values such that they could be considered unordered, i.e.,
isSpaceSufficientForAllCategoricalSplits=true).
* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue,
binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,â¦, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).
* Corrected this bug by changing updateBinForUnorderedFeature to use the
second indexing pattern.
Unit tests in DecisionTreeSuite
* Updated a few tests to train a model and test its training accuracy,
which catches the indexing bug from updateBinForUnorderedFeature() discussed
above.
* Added new test (âstump with categorical variables for multiclass
classification, with just enough binsâ) to test bin extremes.
Bug fix: calculateGainForSplit (for classification):
* It used to return dummy prediction values when either the right or left
children had 0 weight. These were incorrect for multiclass classification. It
has been corrected.
Updated impurities to allow for count = 0. This was related to the above
bug fix for calculateGainForSplit (for classification).
Small updates to documentation and coding style.
commit 5fe44ed10450a3fbe407f5326da7391569003a78
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-30T23:07:46Z
Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
commit 8a758dbb18edf6efe8521598ab8da41736908841
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-30T23:08:48Z
Merge branch 'decisiontree-bugfix' into decisiontree-python-new
commit 8ea8750cd5eeefa87d937ca4214a5f548dd2e6a4
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T00:05:49Z
Bug fix: Off-by-1 when finding thresholds for splits for continuous
features.
* Exhibited bug in new test in DecisionTreeSuite: âstump with 1
continuous variable for binary classification, to check off-by-1 errorâ
* Description: When finding thresholds for possible splits for continuous
features in DecisionTree.findSplitsBins, the thresholds were set according to
individual training examplesâ feature values. This can cause problems for
small datasets, when the number of training examples equals numBins.
* Fix: The threshold is set to be the average of 2 consecutive (sorted)
examplesâ feature values. E.g.: If the old code set the threshold using
example i, the new code sets the threshold using examples i and i+1.
* Note: In 4 DecisionTreeSuite tests with all labels identical, removed
check of threshold since it is somewhat arbitrary.
commit cd1d933a3d686107a7a8272b7138b701a820a877
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T00:06:39Z
Merge branch 'decisiontree-bugfix' into decisiontree-python-new
commit 8e227ea826d6b38dc47e9a90ccf6683348c6dab0
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T00:18:55Z
Changed Strategy so it only requires numClassesForClassification >= 2 for
classification
commit da50db749f54a63565440d6c42f78373f1f2a2ac
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T00:32:10Z
Added one more test to DecisionTreeSuite: stump with 2 continuous variables
for binary classification. Caused problems in past, but fixed now.
commit f5a036c4eff3499f5456c441572ffb11514385c9
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T00:33:28Z
Merge branch 'decisiontree-bugfix' into decisiontree-python-new
commit 52e17c5b249afa10eb151e73ca36a72b4e6adbe8
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T16:24:21Z
Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
commit 59750f87c974299720ec556908c7e29b131d3476
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T18:08:46Z
* Updated Strategy to check numClassesForClassification only if
algo=Classification.
* Updates based on comments:
** DecisionTreeRunner
*** Made dataFormat arg default to libsvm
** Small cleanups
** tree.Node: Made recursive helper methods private, and renamed them.
commit bab3f190c51a8feced2bdb7d146072fcfa8cab72
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T18:10:55Z
Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
commit e06e423d7b046ae7e38001325ad7330a15179472
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T18:11:27Z
Merge branch 'decisiontree-bugfix' into decisiontree-python-new
commit 376dca2c848739b1536e6ee8ddbc55043d1eef7a
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T18:27:18Z
Updated meaning of maxDepth by 1 to fit scikit-learn and rpart.
* In code, replaced usages of maxDepth <-- maxDepth + 1
* In params, replace settings of maxDepth <-- maxDepth - 1
commit 6eed4822759377b241c8dd0adadf32102e01d472
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T18:39:00Z
In DecisionTree: Changed from using procedural syntax for functions
returning Unit to explicitly writing Unit return type.
commit 978cfcf84cb0259c7f65738fd3ed70f78928951e
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T18:40:43Z
Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
commit 8bb8aa06a4033277ddd117445783678af4ff3dfd
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T20:02:10Z
Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
commit dab0b674b93c7ada8e9d8ac1fc364c0c9438785b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T20:08:46Z
Added documentation for DecisionTree internals
commit 584449a23f4ce5705fad6d0e5e2bc9f55034bbe5
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T20:09:53Z
Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
commit 1b29c13d829aae78812b03835f309ae37e8d4084
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T20:10:02Z
Merge branch 'decisiontree-bugfix' into decisiontree-python-new
commit 2b20c6151bab8a2ee218b851f40d54133f9807a2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-31T20:39:43Z
Small doc and style updates
commit b8fac571dc4baa58b4c4c1473bb2969553270865
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T01:56:37Z
Finished Python DecisionTree API and example but need to test a bit more.
commit 66222477e4f9cb8c3ce1877312efa501c11bcf84
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T01:56:45Z
Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
commit 188cb0d05f5002ddacf3363b3ca79c41584e69d2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T01:56:53Z
Merge branch 'decisiontree-bugfix' into decisiontree-python-new
commit 665ba7822bde3cb8105efb31d22e0084265c92da
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T16:42:22Z
Small updates towards Python DecisionTree API
commit 4562c08b5f08382f2e382d81f84c161966dc8315
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T16:42:57Z
Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
Conflicts:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
(no real conflict; merged by concatenating)
commit 6df89a9f1130430367b6c7f0daa23e1cdfdc9839
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T20:18:20Z
Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
commit 93953f16e16e4605cbfe8a9e3a26b372e69707ae
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T21:34:54Z
Likely done with Python API.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---