GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/2785
[SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTree
SPARK-3934: When run with a mix of unordered categorical and continuous
features, on multiclass classification, RandomForest fails. The bug is in the
sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the
wrong indices for checking whether features are unordered.
Fix: Remove the sanity checks since they are not really needed, and since
they would require DTStatsAggregator to keep track of an extra set of indices
(for the feature subset).
Added test to RandomForestSuite which failed with old version but now works.
SPARK-3918: Added baggedInput.unpersist at end of training.
Also:
* I removed DTStatsAggregator.isUnordered since it is no longer used.
* DecisionTreeMetadata: Added logWarning when maxBins is automatically
reduced.
* Updated DecisionTreeRunner to explicitly fix the test data to have the
same number of features as the training data. This is a temporary fix which
should eventually be replaced by pre-indexing both datasets.
* RandomForestModel: Updated toString to print total number of nodes in
forest.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark dtrunner-update
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2785.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2785
----
commit 4e88c1f670ee43082e32a47f2bbdda279a385b41
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T18:06:24Z
changed RF toString to print total number of nodes
commit ba567ab90618685df23dad0acf974d6d900a5027
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-10T18:14:44Z
Changed DTRunner to load test data using same number of features as in
training data.
commit 7f3d60fb6e4ca9a4833c963a54082b068470f322
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-11T00:28:22Z
Merge remote-tracking branch 'upstream/master' into dtrunner-update
commit f502e653cd4643600081da36f682a190dae2e3b4
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-13T18:50:20Z
bug fix for SPARK-3934
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]