GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2015

    [mllib] DecisionTree: treeAggregate + Python example bug fix

    Small DecisionTree updates:
    * Changed main DecisionTree aggregate to treeAggregate.
    * Fixed bug in python example decision_tree_runner.py with missing argument 
(since categoricalFeaturesInfo is no longer an optional argument for 
trainClassifier).
    
    CC: @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-opt2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2015.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2015
    
----
commit a95bc22e648d01158d3a4fd597059135e1302266
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-05T18:17:28Z

    timing for DecisionTree internals

commit 511ec85fbe4c4463d8e600fabc5d54c5b2bd8417
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-06T01:16:19Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit bcf874a7444303ac7dc14cc5a36890cec45a8359
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-07T21:53:22Z

    Merge remote-tracking branch 'upstream/master' into dt-timing
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

commit f61e9d227233679ab826e38210376e7050da9b6b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T07:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit 3211f027c1a41f8eaa4eea4e90073216a8474c4e
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T16:46:12Z

    Optimizing DecisionTree
    * Added TreePoint representation to avoid calling findBin multiple times.
    * (not working yet, but debugging)

commit 0f676e2e0ae02e54387a255ac9f64d3c7265d152
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T21:12:52Z

    Optimizations + Bug fix for DecisionTree
    
    Optimization: Added TreePoint representation so we only call findBin once 
for each example, feature.
    
    Also, calculateGainsForAllNodeSplits now only searches over actual splits, 
not empty/unused ones.
    
    BUG FIX: isSampleValid
    * isSampleValid used to treat unordered categorical features incorrectly: 
It treated the bins as if indexed by featured values, rather than by subsets of 
values/categories.
    * exhibited for unordered features (multi-class classification with 
categorical features of low arity)
    * Fix: Index bins correctly for unordered categorical features.
    
    Also: some commented-out debugging println calls in DecisionTree, to be 
removed later

commit b2ed1f39ecc967a663a88241b46e5786eb66be22
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-08T21:15:44Z

    Merge remote-tracking branch 'upstream/master' into dt-opt

commit b914f3b7ed94e897b55f28c772f48a7d6fba7f06
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-09T19:01:45Z

    DecisionTree optimization: eliminated filters + small changes
    
    DecisionTree.scala
    * Eliminated filters, replaced by building tree on the fly and filtering 
top-down.
    ** Aggregation over examples now skips examples which do not reach the 
current level.
    * Only calculate unorderedFeatures once (in findSplitsBins)
    
    Node: Renamed predictIfLeaf to predict
    
    Bin, Split: Updated doc

commit c1565a5248e5d0ccc2293315799281030a74c217
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-11T18:09:32Z

    Small DecisionTree updates:
    * Simplification: Updated calculateGainForSplit to take aggregates for a 
single (feature, split) pair.
    * Internal doc: findAggForOrderedFeatureClassification

commit a87e08f1e5999c31b956a34617f88ff9a50775ae
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T18:34:12Z

    Merge remote-tracking branch 'upstream/master' into dt-opt1

commit 8464a6efd644daf9954ba43c9790ec304f94e029
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:26:57Z

    Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  
Removed debugging println calls from DecisionTree.  Made TreePoint extend 
Serialiable

commit e66f1b1cb2252dab1f847f2c24623baab40627fc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T19:58:22Z

    TreePoint
    * Updated doc
    * Made some methods private
    
    Changed timer to report time in seconds.

commit d03608949e19c53596b4f6cc09d9f68011184d68
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T20:07:14Z

    Print timing info to logDebug.

commit 430d782294a08f63535e2ecce167703021e1fe44
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T23:09:14Z

    Added more debug info on binning error.  Added some docs.

commit 356dabac6bad8b2e2a9f7b90aaae80d987c113dc
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-14T23:57:13Z

    Merge branch 'dt-opt1' into dt-opt2
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
        mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

commit 26d10dd58ee218102bd205c1e6d68fda5a45cf4b
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T00:44:08Z

    Removed tree/model/Filter.scala since no longer used.  Removed debugging 
println calls in DecisionTree.scala.

commit 2d2aaaffd630e5a9376a321ac5b7d2a64bcd13e2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T16:46:16Z

    Merge remote-tracking branch 'upstream/master' into dt-opt1

commit 6b5651e7671315f78aef42344ab514e3cf8052df
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T19:28:47Z

    Updates based on code review.  1 major change: persisting to memory + disk, 
not just memory.
    
    Details:
    
    DecisionTree
    * Changed: .cache() -> .persist(StorageLevel.MEMORY_AND_DISK)
    ** This gave major performance improvements on small tests.  E.g., 500K 
examples, 500 features, depth 5, on MacBook, took 292 sec with cache() and 112 
when using disk as well.
    * Change for to while loops
    * Small cleanups
    
    TimeTracker
    * Removed useless timing in DecisionTree
    
    TreePoint
    * Renamed features to binnedFeatures

commit 5f2dec2e3c5e17ce79cef119ef039323dbd73942
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T19:43:50Z

    Fixed scalastyle issue in TreePoint

commit f40381c0ecf76506f7b727d1d6ca715fe7716065
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T21:28:07Z

    Merge branch 'dt-opt1' into dt-opt2
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
        mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
    
    Merge is OK except one DT Suite test to fix.

commit 797f68a13323bbebac099ea1a98654fb7b2984a0
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T22:19:01Z

    Fixed DecisionTreeSuite bug for training second level.  Needed to update 
treePointToNodeIndex with groupShift.

commit 931a3a714e44ab1138dac18bc3b497892d36e3e5
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-15T22:22:26Z

    Merge remote-tracking branch 'upstream/master' into dt-opt2

commit 6a38f48322b8dfbf4d866452c7952dae6d09397a
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-16T01:28:30Z

    Added DTMetadata class for cleaner code

commit db0d7732a8cece3dc0188923e6c2939c09ee686a
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-16T01:35:16Z

    scala style fix

commit ac0b9f84ededb9aaee477f439f711d9be8e890bd
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-16T04:53:38Z

    Small updates based on code review.
    Main change: Now using << instead of math.pow.

commit 3726d2003e681e569a6d1fbf4af65909500f1b80
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-17T03:55:32Z

    Small code improvements based on code review.

commit a0ed0daa4c3622e19626de7aa3b29e07c6015ff2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-17T05:47:35Z

    Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.

commit 66d076f2a042fe21558fec022a389800d514b5d2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T03:52:35Z

    Merge remote-tracking branch 'upstream/master' into dt-opt2

commit 85bbc1fa6f9813661998e4a051670a3e59e1f679
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T04:49:52Z

    Merge remote-tracking branch 'upstream/master' into dt-opt2

commit b7b2922b1bc4a6192b36d9a930df86b5d5d6d13f
Author: Joseph K. Bradley <[email protected]>
Date:   2014-08-18T04:51:04Z

    Fixed bug in python example decision_tree_runner.py with missing argument.  
Changed main DecisionTree aggregate to treeAggregate.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to