GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2785

    [SPARK-3934] [SPARK-3918] [mllib]  Bug fixes for RandomForest, DecisionTree

    SPARK-3934: When run with a mix of unordered categorical and continuous 
features, on multiclass classification, RandomForest fails. The bug is in the 
sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the 
wrong indices for checking whether features are unordered.
    Fix: Remove the sanity checks since they are not really needed, and since 
they would require DTStatsAggregator to keep track of an extra set of indices 
(for the feature subset).
    
    Added test to RandomForestSuite which failed with old version but now works.
    
    SPARK-3918: Added baggedInput.unpersist at end of training.
    
    Also:
    * I removed DTStatsAggregator.isUnordered since it is no longer used.
    * DecisionTreeMetadata: Added logWarning when maxBins is automatically 
reduced.
    * Updated DecisionTreeRunner to explicitly fix the test data to have the 
same number of features as the training data.  This is a temporary fix which 
should eventually be replaced by pre-indexing both datasets.
    * RandomForestModel: Updated toString to print total number of nodes in 
forest.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dtrunner-update

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2785.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2785
    
----
commit 4e88c1f670ee43082e32a47f2bbdda279a385b41
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-10T18:06:24Z

    changed RF toString to print total number of nodes

commit ba567ab90618685df23dad0acf974d6d900a5027
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-10T18:14:44Z

    Changed DTRunner to load test data using same number of features as in 
training data.

commit 7f3d60fb6e4ca9a4833c963a54082b068470f322
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-11T00:28:22Z

    Merge remote-tracking branch 'upstream/master' into dtrunner-update

commit f502e653cd4643600081da36f682a190dae2e3b4
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-13T18:50:20Z

    bug fix for SPARK-3934

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to