GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1975

    [SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up

    DecisionTree needs to match each example to a node at each iteration.  It 
currently does this with a set of filters very inefficiently: For each example, 
it examines each node at the current level and traces up to the root to see if 
that example should be handled by that node.
    
    Fix: Filter top-down using the partly built tree itself.
    
    Major changes:
    * Eliminated Filter class, findBinsForLevel() method.
    * Set up node parent links in main loop over levels in train().
    * Added predictNodeIndex() for filtering top-down.
    
    Other changes:
    * Pre-compute set of unorderedFeatures.
    
    Notes for following expected PR based on 
[https://issues.apache.org/jira/browse/SPARK-3043]:
    * The unorderedFeatures set will next be stored in a metadata structure to 
simplify function calls (to store other items such as the data in strategy).
    
    I've done initial tests indicating that this speeds things up, but am only 
now running large-scale ones.
    
    CC: @mengxr @manishamde @chouqin  Any comments are welcome---thanks!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-opt2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1975.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1975
    
----
commit a95bc22e648d01158d3a4fd597059135e1302266
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-05T18:17:28Z

    timing for DecisionTree internals

commit 511ec85fbe4c4463d8e600fabc5d54c5b2bd8417
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-06T01:16:19Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit bcf874a7444303ac7dc14cc5a36890cec45a8359
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-07T21:53:22Z

    Merge remote-tracking branch 'upstream/master' into dt-timing
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

commit f61e9d227233679ab826e38210376e7050da9b6b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T07:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit 3211f027c1a41f8eaa4eea4e90073216a8474c4e
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T16:46:12Z

    Optimizing DecisionTree
    * Added TreePoint representation to avoid calling findBin multiple times.
    * (not working yet, but debugging)

commit 0f676e2e0ae02e54387a255ac9f64d3c7265d152
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T21:12:52Z

    Optimizations + Bug fix for DecisionTree
    
    Optimization: Added TreePoint representation so we only call findBin once 
for each example, feature.
    
    Also, calculateGainsForAllNodeSplits now only searches over actual splits, 
not empty/unused ones.
    
    BUG FIX: isSampleValid
    * isSampleValid used to treat unordered categorical features incorrectly: 
It treated the bins as if indexed by featured values, rather than by subsets of 
values/categories.
    * exhibited for unordered features (multi-class classification with 
categorical features of low arity)
    * Fix: Index bins correctly for unordered categorical features.
    
    Also: some commented-out debugging println calls in DecisionTree, to be 
removed later

commit b2ed1f39ecc967a663a88241b46e5786eb66be22
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T21:15:44Z

    Merge remote-tracking branch 'upstream/master' into dt-opt

commit b914f3b7ed94e897b55f28c772f48a7d6fba7f06
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-09T19:01:45Z

    DecisionTree optimization: eliminated filters + small changes
    
    DecisionTree.scala
    * Eliminated filters, replaced by building tree on the fly and filtering 
top-down.
    ** Aggregation over examples now skips examples which do not reach the 
current level.
    * Only calculate unorderedFeatures once (in findSplitsBins)
    
    Node: Renamed predictIfLeaf to predict
    
    Bin, Split: Updated doc

commit c1565a5248e5d0ccc2293315799281030a74c217
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-11T18:09:32Z

    Small DecisionTree updates:
    * Simplification: Updated calculateGainForSplit to take aggregates for a 
single (feature, split) pair.
    * Internal doc: findAggForOrderedFeatureClassification

commit a87e08f1e5999c31b956a34617f88ff9a50775ae
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T18:34:12Z

    Merge remote-tracking branch 'upstream/master' into dt-opt1

commit 8464a6efd644daf9954ba43c9790ec304f94e029
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:26:57Z

    Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  
Removed debugging println calls from DecisionTree.  Made TreePoint extend 
Serialiable

commit e66f1b1cb2252dab1f847f2c24623baab40627fc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:58:22Z

    TreePoint
    * Updated doc
    * Made some methods private
    
    Changed timer to report time in seconds.

commit d03608949e19c53596b4f6cc09d9f68011184d68
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T20:07:14Z

    Print timing info to logDebug.

commit 430d782294a08f63535e2ecce167703021e1fe44
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T23:09:14Z

    Added more debug info on binning error.  Added some docs.

commit 356dabac6bad8b2e2a9f7b90aaae80d987c113dc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T23:57:13Z

    Merge branch 'dt-opt1' into dt-opt2
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
        mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

commit 26d10dd58ee218102bd205c1e6d68fda5a45cf4b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T00:44:08Z

    Removed tree/model/Filter.scala since no longer used.  Removed debugging 
println calls in DecisionTree.scala.

commit 2d2aaaffd630e5a9376a321ac5b7d2a64bcd13e2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T16:46:16Z

    Merge remote-tracking branch 'upstream/master' into dt-opt1

commit 6b5651e7671315f78aef42344ab514e3cf8052df
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T19:28:47Z

    Updates based on code review.  1 major change: persisting to memory + disk, 
not just memory.
    
    Details:
    
    DecisionTree
    * Changed: .cache() -> .persist(StorageLevel.MEMORY_AND_DISK)
    ** This gave major performance improvements on small tests.  E.g., 500K 
examples, 500 features, depth 5, on MacBook, took 292 sec with cache() and 112 
when using disk as well.
    * Change for to while loops
    * Small cleanups
    
    TimeTracker
    * Removed useless timing in DecisionTree
    
    TreePoint
    * Renamed features to binnedFeatures

commit 5f2dec2e3c5e17ce79cef119ef039323dbd73942
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T19:43:50Z

    Fixed scalastyle issue in TreePoint

commit f40381c0ecf76506f7b727d1d6ca715fe7716065
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T21:28:07Z

    Merge branch 'dt-opt1' into dt-opt2
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
        mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
    
    Merge is OK except one DT Suite test to fix.

commit 797f68a13323bbebac099ea1a98654fb7b2984a0
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T22:19:01Z

    Fixed DecisionTreeSuite bug for training second level.  Needed to update 
treePointToNodeIndex with groupShift.

commit 931a3a714e44ab1138dac18bc3b497892d36e3e5
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T22:22:26Z

    Merge remote-tracking branch 'upstream/master' into dt-opt2

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to