GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/9474

    [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees

    Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
data as needed for unordered categorical features. Here's an example.
    
    Say there are 3 categories A, B, C. We consider 3 splits:
    
    * A vs. B, C
    * A, B vs. C
    * A, C vs. B
    
    Currently, we collect statistics for each of the 6 subsets of categories (3 
* 2 = 6). However, we could instead collect statistics for the 3 subsets on the 
left-hand side of the 3 possible splits: A and A,B and A,C. If we also have 
stats for the entire node, then we can compute the stats for the 3 subsets on 
the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - 
stats(A).
    
    This patch adds a parent stats array to the `DTStatsAggregator` so that the 
right child stats do not need to be stored. The right child stats are computed 
by subtracting left child stats from the parent stats for unordered categorical 
features.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark SPARK-10788

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9474
    
----
commit 19dc9360742cca88e110be1737fbb6b1b680d974
Author: sethah <[email protected]>
Date:   2015-10-02T15:53:56Z

    Removing superfluous bins in decision tree training

commit d19385180f054dae653173b2a87cdb659d942111
Author: sethah <[email protected]>
Date:   2015-11-02T18:15:41Z

    adding parent stats to aggregator

commit 9fb9558ac148081285fe3faf537f9fab6bef0d22
Author: sethah <[email protected]>
Date:   2015-11-02T20:48:29Z

    reverting to mllib based change

commit defda7c2e5b6a7d9cdfcc48b2ab9a0f465161768
Author: sethah <[email protected]>
Date:   2015-11-04T00:38:52Z

    style cleanup

commit a84c40b31bd5c78d400085b7aa1004314701be1d
Author: sethah <[email protected]>
Date:   2015-11-04T00:47:37Z

    changing scopes

commit e09de2d386999e50ff9349f206dd7916a90ab325
Author: sethah <[email protected]>
Date:   2015-11-04T19:25:18Z

    removing obsolete methods

commit 1c77427ca260cd0e3c472478372b6fe596cb6d9e
Author: sethah <[email protected]>
Date:   2015-11-04T21:44:14Z

    adding test for number of bins

commit 19baecbe9d9226a332319025ba5d05a2a3c285fa
Author: sethah <[email protected]>
Date:   2015-11-04T22:21:30Z

    clone parent stats in getImpurityCalculator

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to