GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1950

    [SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered 
feature bug fix

    
    DecisionTree improvements:
    (1) TreePoint representation to avoid binning multiple times
    (2) Bug fix: isSampleValid indexed bins incorrectly for unordered 
categorical features
    (3) Timing for DecisionTree internals
    
    Details:
    
    (1) TreePoint representation to avoid binning multiple times
    
    [https://issues.apache.org/jira/browse/SPARK-3022]
    
    Added private[tree] TreePoint class for representing binned feature values.
    
    The input RDD of LabeledPoint is converted to the TreePoint representation 
initially and then cached.  This avoids the previous problem of re-computing 
bins multiple times.
    
    (2) Bug fix: isSampleValid indexed bins incorrectly for unordered 
categorical features
    
    [https://issues.apache.org/jira/browse/SPARK-3041]
    
    isSampleValid used to treat unordered categorical features incorrectly: It 
treated the bins as if indexed by featured values, rather than by subsets of 
values/categories.
    * exhibited for unordered features (multi-class classification with 
categorical features of low arity)
    * Fix: Index bins correctly for unordered categorical features.
    
    (3) Timing for DecisionTree internals
    
    Added tree/impl/TimeTracker.scala class which is private[tree] for now, for 
timing key parts of DT code.
    Prints timing info via logDebug.
    
    CC: @mengxr @manishamde @chouqin  Very similar update, with one bug fix.  
Many apologies for the conflicting update, but I hope that a few more 
optimizations I have on the way (which depend on this update) will prove 
valuable to you: SPARK-3042 and SPARK-3043

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-opt1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1950.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1950
    
----
commit a95bc22e648d01158d3a4fd597059135e1302266
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-05T18:17:28Z

    timing for DecisionTree internals

commit 511ec85fbe4c4463d8e600fabc5d54c5b2bd8417
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-06T01:16:19Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit bcf874a7444303ac7dc14cc5a36890cec45a8359
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-07T21:53:22Z

    Merge remote-tracking branch 'upstream/master' into dt-timing
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

commit f61e9d227233679ab826e38210376e7050da9b6b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T07:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit 3211f027c1a41f8eaa4eea4e90073216a8474c4e
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T16:46:12Z

    Optimizing DecisionTree
    * Added TreePoint representation to avoid calling findBin multiple times.
    * (not working yet, but debugging)

commit 0f676e2e0ae02e54387a255ac9f64d3c7265d152
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T21:12:52Z

    Optimizations + Bug fix for DecisionTree
    
    Optimization: Added TreePoint representation so we only call findBin once 
for each example, feature.
    
    Also, calculateGainsForAllNodeSplits now only searches over actual splits, 
not empty/unused ones.
    
    BUG FIX: isSampleValid
    * isSampleValid used to treat unordered categorical features incorrectly: 
It treated the bins as if indexed by featured values, rather than by subsets of 
values/categories.
    * exhibited for unordered features (multi-class classification with 
categorical features of low arity)
    * Fix: Index bins correctly for unordered categorical features.
    
    Also: some commented-out debugging println calls in DecisionTree, to be 
removed later

commit a87e08f1e5999c31b956a34617f88ff9a50775ae
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T18:34:12Z

    Merge remote-tracking branch 'upstream/master' into dt-opt1

commit 8464a6efd644daf9954ba43c9790ec304f94e029
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:26:57Z

    Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  
Removed debugging println calls from DecisionTree.  Made TreePoint extend 
Serialiable

commit e66f1b1cb2252dab1f847f2c24623baab40627fc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:58:22Z

    TreePoint
    * Updated doc
    * Made some methods private
    
    Changed timer to report time in seconds.

commit d03608949e19c53596b4f6cc09d9f68011184d68
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T20:07:14Z

    Print timing info to logDebug.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to