GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2125

    [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]  DecisionTree aggregation 
improvements

    Summary:
    (1) Variable numBins for each feature [SPARK-3043]
    (2) Reduced data reshaping in aggregation [SPARK-3043]
    (3) Choose ordering for ordered categorical features adaptively [SPARK-3156]
    (4) Changed nodes to use 1-indexing [SPARK-3086]
    (5) Small clean-ups
    
    Note: This PR looks bigger than it is since I moved several functions from 
inside findBestSplitsPerGroup to outside of it (to make it clear what was being 
serialized in the aggregation).
    
    Speedups: This update helps most when many features use few bins but a few 
features use many bins.  Some example results on speedups with 2M examples, 
3.5K features (15-worker EC2 cluster):
    * Example where old code was reasonably efficient (1/2 continuous, 1/4 
binary, 1/4 20-category): 164.813 --> 116.491 sec
    * Example where old code wasted many bins (1/10 continuous, 81/100 binary, 
9/100 20-category): 128.701 --> 39.334 sec
    
    Details:
    
    (1) Variable numBins for each feature [SPARK-3043]
    
    DecisionTreeMetadata now computes a variable numBins for each feature.  It 
also tracks numSplits.
    
    (2) Reduced data reshaping in aggregation [SPARK-3043]
    
    Added DTStatsAggregator, a wrapper around the aggregate statistics array 
for easy but efficient indexing.
    * Added ImpurityAggregator and ImpurityCalculator classes, to make 
DecisionTree code more oblivious to the type of impurity.
    
    The aggregate statistics are never reshaped, and cumulative sums are 
computed in-place.
    
    Updated the layout of aggregation functions.  The update simplifies things 
by (1) dividing features into ordered/unordered (instead of 
ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for 
indexing.
    For this update, the following functions were refactored:
    * updateBinForOrderedFeature
    * updateBinForUnorderedFeature
    * binaryOrNotCategoricalBinSeqOp
    * multiclassWithCategoricalBinSeqOp
    * regressionBinSeqOp
    The above 5 functions were replaced with:
    * orderedBinSeqOp
    * someUnorderedBinSeqOp
    
    Other changes:
    * calculateGainForSplit now treats all feature types the same way.
    * Eliminated extractLeftRightNodeAggregates.
    
    (3) Choose ordering for ordered categorical features adaptively [SPARK-3156]
    
    Updated binsToBestSplit():
    * This now computes cumulative sums of stats for ordered features.
    * For ordered categorical features, it chooses an ordering for categories. 
(This uses to be done by findSplitsBins.)
    * Uses iterators to shorten code and avoid building an 
Array[Array[InformationGainStats]].
    
    Side effects:
    * In findSplitsBins: A sample of the data is only taken for data with 
continuous features.  It is not needed for data with only categorical features.
    * In findSplitsBins: splits and bins are no longer pre-computed for ordered 
categorical features since they are not needed.
    * TreePoint binning is simpler for categorical features.
    
    (4) Changed nodes to use 1-indexing [SPARK-3086]
    
    Nodes used to be indexed from 0.  Now they are indexed from 1.
    Node indexing functions are now collected in object Node (Node.scala).
    
    (5) Small clean-ups
    
    Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to 
reduce duplicate code.
    Eliminated InvalidBinIndex since it is no longer used.
    
    CC: @mengxr  @manishamde  Please let me know if you have thoughts on 
this—thanks!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-opt3alt

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2125
    
----
commit a95bc22e648d01158d3a4fd597059135e1302266
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-05T18:17:28Z

    timing for DecisionTree internals

commit 511ec85fbe4c4463d8e600fabc5d54c5b2bd8417
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-06T01:16:19Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit bcf874a7444303ac7dc14cc5a36890cec45a8359
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-07T21:53:22Z

    Merge remote-tracking branch 'upstream/master' into dt-timing
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

commit f61e9d227233679ab826e38210376e7050da9b6b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T07:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit 3211f027c1a41f8eaa4eea4e90073216a8474c4e
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T16:46:12Z

    Optimizing DecisionTree
    * Added TreePoint representation to avoid calling findBin multiple times.
    * (not working yet, but debugging)

commit 0f676e2e0ae02e54387a255ac9f64d3c7265d152
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T21:12:52Z

    Optimizations + Bug fix for DecisionTree
    
    Optimization: Added TreePoint representation so we only call findBin once 
for each example, feature.
    
    Also, calculateGainsForAllNodeSplits now only searches over actual splits, 
not empty/unused ones.
    
    BUG FIX: isSampleValid
    * isSampleValid used to treat unordered categorical features incorrectly: 
It treated the bins as if indexed by featured values, rather than by subsets of 
values/categories.
    * exhibited for unordered features (multi-class classification with 
categorical features of low arity)
    * Fix: Index bins correctly for unordered categorical features.
    
    Also: some commented-out debugging println calls in DecisionTree, to be 
removed later

commit b2ed1f39ecc967a663a88241b46e5786eb66be22
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T21:15:44Z

    Merge remote-tracking branch 'upstream/master' into dt-opt

commit b914f3b7ed94e897b55f28c772f48a7d6fba7f06
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-09T19:01:45Z

    DecisionTree optimization: eliminated filters + small changes
    
    DecisionTree.scala
    * Eliminated filters, replaced by building tree on the fly and filtering 
top-down.
    ** Aggregation over examples now skips examples which do not reach the 
current level.
    * Only calculate unorderedFeatures once (in findSplitsBins)
    
    Node: Renamed predictIfLeaf to predict
    
    Bin, Split: Updated doc

commit c1565a5248e5d0ccc2293315799281030a74c217
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-11T18:09:32Z

    Small DecisionTree updates:
    * Simplification: Updated calculateGainForSplit to take aggregates for a 
single (feature, split) pair.
    * Internal doc: findAggForOrderedFeatureClassification

commit fd653725dff2ad1de2aaf7eac0b06bbeee8d1129
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-13T04:13:09Z

    Major changes:
    * Created ImpurityAggregator classes, rather than old aggregates.
    * Feature split/bin semantics are based on ordered vs. unordered
    ** E.g.: numSplits = numBins for all unordered features, and numSplits = 
numBins - 1 for all ordered features.
    * numBins can differ for each feature
    
    DecisionTree
    * Major changes based on new aggregator setup
    ** For ordered features, aggregate is indexed by: 
(nodeIndex)(featureIndex)(binIndex).
    ** For unordered features, aggregate is indexed by: 
(nodeIndex)(featureIndex)(2 * binIndex),
    * Added LearningMetadata class
    * Eliminated now-unused functions:
    ** extractNodeInfo
    ** getNumSplitsForFeature
    ** getBinDataForNode (Eliminated since it merely slices/reshapes data.)
    
    ImpurityAggregator classes
    * Changed main aggregate operation to create binAggregates (binSeqOp, 
binCompOp) to use the aggregator.
    * Before, for unordered features, the left/right bins were treated as a 
separate dimension for aggregates.  They are now part of the bins: 
binAggregates is of size: (numNodes, numBins_f, numFeatures) where numBins_f is:
    ** 2 * [pow(2, maxFeatureValue - 1) - 1] for unordered categorical features
    ** maxFeatureValue for ordered categorical features
    ** maxBins for continuous features
    
    DecisionTreeSuite
    * For tests using unordered (low-arity) features, removed checks of 
Bin.category, which only has meaning for ordered features.

commit 51ef7813d9fb1c98457f015e1aa7dca92816750a
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-13T08:26:21Z

    Fixed bug introduced by last commit: Variance impurity calculation was 
incorrect since counts were swapped accidentally

commit e3c84ccf06f58fce235fb387c7fd0b432103e5a1
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T00:49:25Z

    Added stuff fro mnist8m to D T Runner

commit 86e217fb454e2834f92ecfbebd33419c886fe944
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T00:58:52Z

    added cache to DT input

commit 438a66018775dc928644d32e833aecd6c2265109
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T01:17:16Z

    removed subsampling for mnist8m from DT

commit dd4d3aa65e796d5b6ac36e36c0172cc90ad4ae15
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T18:27:34Z

    Mid-process in bug fix: bug for binary classification with categorical 
features
    * Bug: Categorical features were all treated as ordered for binary 
classification.  This is possible but would require the bin ordering to be 
determined on-the-fly after the aggregation.  Currently, the ordering is 
determined a priori and fixed for all splits.
    * (Temp) Fix: Treat low-arity categorical features as unordered for binary 
classification.
    * Related change: I removed most tests for isMulticlass in the code.  I 
instead test metadata for whether there are unordered features.
    * Status: The bug may be fixed, but more testing needs to be done.
    
    Aggregates: The same binMultiplier (for ordered vs. unordered) was applied 
to all features.  It is now applied on a per-feature basis.

commit a87e08f1e5999c31b956a34617f88ff9a50775ae
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T18:34:12Z

    Merge remote-tracking branch 'upstream/master' into dt-opt1

commit 8464a6efd644daf9954ba43c9790ec304f94e029
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:26:57Z

    Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  
Removed debugging println calls from DecisionTree.  Made TreePoint extend 
Serialiable

commit e66f1b1cb2252dab1f847f2c24623baab40627fc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:58:22Z

    TreePoint
    * Updated doc
    * Made some methods private
    
    Changed timer to report time in seconds.

commit d03608949e19c53596b4f6cc09d9f68011184d68
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T20:07:14Z

    Print timing info to logDebug.

commit 430d782294a08f63535e2ecce167703021e1fe44
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T23:09:14Z

    Added more debug info on binning error.  Added some docs.

commit 356dabac6bad8b2e2a9f7b90aaae80d987c113dc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T23:57:13Z

    Merge branch 'dt-opt1' into dt-opt2
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
        mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

commit 26d10dd58ee218102bd205c1e6d68fda5a45cf4b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T00:44:08Z

    Removed tree/model/Filter.scala since no longer used.  Removed debugging 
println calls in DecisionTree.scala.

commit 5fce6353a818087198307a3932ece044a09e45ab
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T16:45:28Z

    Merge branch 'dt-opt2' into dt-opt3
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
        mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala
        mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

commit 45f7ea7afdbe100b6a489da71805049a9d63995b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-17T17:17:18Z

    partial merge, not yet done

commit 9c833639ec935a1f372ea0655012259957d8778b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T03:23:17Z

    partial merge but not done yet

commit b3146594390cb4b6edd2b47e56611b7e42dc0f77
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T17:25:32Z

    Merge remote-tracking branch 'upstream/master' into dt-opt3

commit 3ba716667460310ef3bd897af9cb624a80e7fa94
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T20:25:47Z

    Merge remote-tracking branch 'upstream/master' into dt-opt3

commit 61c45093a9ae73b03e7a6737424101e45c5aa123
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T21:02:33Z

    Fixed bugs from merge: missing DT timer call, and numBins setting.  Cleaned 
up DT Suite some.

commit 5f94342bda903aac294ab76ab5ba89eb3751d3ab
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T22:32:04Z

    Added treeAggregate since not yet merged from master.  Moved node indexing 
functions to Node.

commit 95cad7cb16e659a58985230295f62a438d4a84ab
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T22:32:45Z

    Merge remote-tracking branch 'upstream/master' into dt-opt3
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to