GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/1740
[SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use
Bug fix: Before, when an RDD was created in Java and passed to
DecisionTree.train(), the fake class tag caused problems.
* Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from
Java.
Other improvements to Decision Trees for easy-of-use with Java:
* impurity classes: Added instance() methods to help with Java interface.
* Strategy: Added Java-friendly constructor
** Note: I removed quantileCalculationStrategy from the Java-friendly
constructor since (a) it is a special class and (b) there is only 1 option
currently. I suspect we will redo the API before the other options are
included.
CC: @mengxr
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark dt-java-new
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1740.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1740
----
commit 225822fe38762596b8c917a867e5cdbb2d9b4b55
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T21:50:42Z
Bug: In DecisionTree, the method
sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins
from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is
the bound for unordered categorical features, not ordered ones. The upper bound
should be the arity (i.e., max value) of the feature.
Added new test to DecisionTreeSuite to catch this: "regression stump with
categorical variables of arity 2"
Bug fix: Modified upper bound discussed above.
Also: Small improvements to coding style in DecisionTree.
commit f1a8283c5cb6a497a9ac60c8ce1859dbe9a051b0
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T22:56:09Z
Added old JavaDecisionTreeSuite, to be updated later
commit 13a585e5b818735dfc6aa481547fc201ddfc1798
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-02T00:18:12Z
Merge remote-tracking branch 'upstream/master' into dt-java
commit 320853f464ca8658d7e28a9f39f288da33c88b23
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-02T00:40:53Z
Added JavaDecisionTreeSuite, partly written
commit d78ada636f490db6fb1e4a9f75af7f492c07f222
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-02T06:32:49Z
Merge remote-tracking branch 'upstream/master' into dt-java
commit f7b5ca1ed464de5d7d20f4c006621afa8d8b9e56
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-02T19:56:47Z
Improvements to make it easier to run DecisionTree from Java.
* DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.
* impurity classes: Added instance() methods to help with Java interface.
* Strategy: Added Java-friendly constructor
** Note: I removed quantileCalculationStrategy from the Java-friendly
constructor since (a) it is a special class and (b) there is only 1 option
currently. I suspect we will redo the API before the other options are
included.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]