GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1727

    [SPARK-2478] [mllib] DecisionTree Python API

    Added experimental Python API for Decision Trees.
    
    API:
    * class DecisionTreeModel
    ** predict() for single examples and RDDs, taking both feature vectors and 
LabeledPoints
    ** numNodes()
    ** depth()
    ** __str__()
    * class DecisionTree
    ** trainClassifier()
    ** trainRegressor()
    ** train()
    
    Examples and testing:
    * Added example testing classification and regression with batch 
prediction: examples/src/main/python/mllib/tree.py
    * Have also tested example usage in doc of python/pyspark/mllib/tree.py 
which tests single-example prediction with dense and sparse vectors
    
    Also: Small bug fix in python/pyspark/mllib/_common.py: In 
_linear_predictor_typecheck, changed check for RDD to use isinstance() instead 
of type() in order to catch RDD subclasses.
    
    CC @mengxr @manishamde

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark decisiontree-python-new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1727
    
----
commit f8253520045d90c75b143d810edbb746f86cad8c
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-30T21:48:41Z

    Wrote Python API and example for DecisionTree.  Also added toString, depth, 
and numNodes methods to DecisionTreeModel.

commit 5f920a10b6114baa0744f55843969843b1f2babc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-30T22:24:55Z

    Demonstration of bug before submitting fix: Updated DecisionTreeSuite so 
that 3 tests fail.  Will describe bug in next commit.

commit 73fbea2b2a921111cf22f4d9c76ea23c6a4f7afe
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-30T22:52:22Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit 2283df878178d3b8c86ecde1d4220076af25b72f
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-30T22:53:14Z

    2 bug fixes.
    
    Indexing was inconsistent for aggregate calculations for unordered features 
(in multiclass classification with categorical features, where the features had 
few enough values such that they could be considered unordered, i.e., 
isSpaceSufficientForAllCategoricalSplits=true).
    
    * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
binIndex), where
    ** featureValue was from arr (so it was a feature value)
    ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
    * The rest of the code indexed agg as (node, feature, binIndex, label).
    * Corrected this bug by changing updateBinForUnorderedFeature to use the 
second indexing pattern.
    
    Unit tests in DecisionTreeSuite
    * Updated a few tests to train a model and test its training accuracy, 
which catches the indexing bug from updateBinForUnorderedFeature() discussed 
above.
    * Added new test (“stump with categorical variables for multiclass 
classification, with just enough bins”) to test bin extremes.
    
    Bug fix: calculateGainForSplit (for classification):
    * It used to return dummy prediction values when either the right or left 
children had 0 weight.  These were incorrect for multiclass classification.  It 
has been corrected.
    
    Updated impurities to allow for count = 0.  This was related to the above 
bug fix for calculateGainForSplit (for classification).
    
    Small updates to documentation and coding style.

commit 5fe44ed10450a3fbe407f5326da7391569003a78
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-30T23:07:46Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 8a758dbb18edf6efe8521598ab8da41736908841
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-30T23:08:48Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 8ea8750cd5eeefa87d937ca4214a5f548dd2e6a4
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T00:05:49Z

    Bug fix: Off-by-1 when finding thresholds for splits for continuous 
features.
    
    * Exhibited bug in new test in DecisionTreeSuite: “stump with 1 
continuous variable for binary classification, to check off-by-1 error”
    
    * Description: When finding thresholds for possible splits for continuous 
features in DecisionTree.findSplitsBins, the thresholds were set according to 
individual training examples’ feature values.  This can cause problems for 
small datasets, when the number of training examples equals numBins.
    
    * Fix: The threshold is set to be the average of 2 consecutive (sorted) 
examples’ feature values.  E.g.: If the old code set the threshold using 
example i, the new code sets the threshold using examples i and i+1.
    
    * Note: In 4 DecisionTreeSuite tests with all labels identical, removed 
check of threshold since it is somewhat arbitrary.

commit cd1d933a3d686107a7a8272b7138b701a820a877
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T00:06:39Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 8e227ea826d6b38dc47e9a90ccf6683348c6dab0
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T00:18:55Z

    Changed Strategy so it only requires numClassesForClassification >= 2 for 
classification

commit da50db749f54a63565440d6c42f78373f1f2a2ac
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T00:32:10Z

    Added one more test to DecisionTreeSuite: stump with 2 continuous variables 
for binary classification.  Caused problems in past, but fixed now.

commit f5a036c4eff3499f5456c441572ffb11514385c9
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T00:33:28Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 52e17c5b249afa10eb151e73ca36a72b4e6adbe8
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T16:24:21Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit 59750f87c974299720ec556908c7e29b131d3476
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T18:08:46Z

    * Updated Strategy to check numClassesForClassification only if 
algo=Classification.
    * Updates based on comments:
    ** DecisionTreeRunner
    *** Made dataFormat arg default to libsvm
    ** Small cleanups
    ** tree.Node: Made recursive helper methods private, and renamed them.

commit bab3f190c51a8feced2bdb7d146072fcfa8cab72
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T18:10:55Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit e06e423d7b046ae7e38001325ad7330a15179472
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T18:11:27Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 376dca2c848739b1536e6ee8ddbc55043d1eef7a
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T18:27:18Z

    Updated meaning of maxDepth by 1 to fit scikit-learn and rpart.
    * In code, replaced usages of maxDepth <-- maxDepth + 1
    * In params, replace settings of maxDepth <-- maxDepth - 1

commit 6eed4822759377b241c8dd0adadf32102e01d472
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T18:39:00Z

    In DecisionTree: Changed from using procedural syntax for functions 
returning Unit to explicitly writing Unit return type.

commit 978cfcf84cb0259c7f65738fd3ed70f78928951e
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T18:40:43Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit 8bb8aa06a4033277ddd117445783678af4ff3dfd
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T20:02:10Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix

commit dab0b674b93c7ada8e9d8ac1fc364c0c9438785b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T20:08:46Z

    Added documentation for DecisionTree internals

commit 584449a23f4ce5705fad6d0e5e2bc9f55034bbe5
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T20:09:53Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 1b29c13d829aae78812b03835f309ae37e8d4084
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T20:10:02Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 2b20c6151bab8a2ee218b851f40d54133f9807a2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-07-31T20:39:43Z

    Small doc and style updates

commit b8fac571dc4baa58b4c4c1473bb2969553270865
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-01T01:56:37Z

    Finished Python DecisionTree API and example but need to test a bit more.

commit 66222477e4f9cb8c3ce1877312efa501c11bcf84
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-01T01:56:45Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 188cb0d05f5002ddacf3363b3ca79c41584e69d2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-01T01:56:53Z

    Merge branch 'decisiontree-bugfix' into decisiontree-python-new

commit 665ba7822bde3cb8105efb31d22e0084265c92da
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-01T16:42:22Z

    Small updates towards Python DecisionTree API

commit 4562c08b5f08382f2e382d81f84c161966dc8315
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-01T16:42:57Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
    
    Conflicts:
        
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
    (no real conflict; merged by concatenating)

commit 6df89a9f1130430367b6c7f0daa23e1cdfdc9839
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-01T20:18:20Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-python-new

commit 93953f16e16e4605cbfe8a9e3a26b372e69707ae
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-01T21:34:54Z

    Likely done with Python API.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to