GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/1720
[SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features
Bug: In DecisionTree, the method
sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins
from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is
the bound for unordered categorical features, not ordered ones. The upper bound
should be the arity (i.e., max value) of the feature.
Added new test to DecisionTreeSuite to catch this: "regression stump with
categorical variables of arity 2"
Bug fix: Modified upper bound discussed above.
Also: Small improvements to coding style in DecisionTree.
CC @mengxr @manishamde
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark decisiontree-bugfix2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1720.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1720
----
commit 225822fe38762596b8c917a867e5cdbb2d9b4b55
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-01T21:50:42Z
Bug: In DecisionTree, the method
sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins
from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is
the bound for unordered categorical features, not ordered ones. The upper bound
should be the arity (i.e., max value) of the feature.
Added new test to DecisionTreeSuite to catch this: "regression stump with
categorical variables of arity 2"
Bug fix: Modified upper bound discussed above.
Also: Small improvements to coding style in DecisionTree.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---