GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/2015
[mllib] DecisionTree: treeAggregate + Python example bug fix
Small DecisionTree updates:
* Changed main DecisionTree aggregate to treeAggregate.
* Fixed bug in python example decision_tree_runner.py with missing argument
(since categoricalFeaturesInfo is no longer an optional argument for
trainClassifier).
CC: @mengxr
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark dt-opt2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2015.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2015
----
commit a95bc22e648d01158d3a4fd597059135e1302266
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-05T18:17:28Z
timing for DecisionTree internals
commit 511ec85fbe4c4463d8e600fabc5d54c5b2bd8417
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-06T01:16:19Z
Merge remote-tracking branch 'upstream/master' into dt-timing
commit bcf874a7444303ac7dc14cc5a36890cec45a8359
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-07T21:53:22Z
Merge remote-tracking branch 'upstream/master' into dt-timing
Conflicts:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
commit f61e9d227233679ab826e38210376e7050da9b6b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T07:35:06Z
Merge remote-tracking branch 'upstream/master' into dt-timing
commit 3211f027c1a41f8eaa4eea4e90073216a8474c4e
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T16:46:12Z
Optimizing DecisionTree
* Added TreePoint representation to avoid calling findBin multiple times.
* (not working yet, but debugging)
commit 0f676e2e0ae02e54387a255ac9f64d3c7265d152
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T21:12:52Z
Optimizations + Bug fix for DecisionTree
Optimization: Added TreePoint representation so we only call findBin once
for each example, feature.
Also, calculateGainsForAllNodeSplits now only searches over actual splits,
not empty/unused ones.
BUG FIX: isSampleValid
* isSampleValid used to treat unordered categorical features incorrectly:
It treated the bins as if indexed by featured values, rather than by subsets of
values/categories.
* exhibited for unordered features (multi-class classification with
categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.
Also: some commented-out debugging println calls in DecisionTree, to be
removed later
commit b2ed1f39ecc967a663a88241b46e5786eb66be22
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T21:15:44Z
Merge remote-tracking branch 'upstream/master' into dt-opt
commit b914f3b7ed94e897b55f28c772f48a7d6fba7f06
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-09T19:01:45Z
DecisionTree optimization: eliminated filters + small changes
DecisionTree.scala
* Eliminated filters, replaced by building tree on the fly and filtering
top-down.
** Aggregation over examples now skips examples which do not reach the
current level.
* Only calculate unorderedFeatures once (in findSplitsBins)
Node: Renamed predictIfLeaf to predict
Bin, Split: Updated doc
commit c1565a5248e5d0ccc2293315799281030a74c217
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-11T18:09:32Z
Small DecisionTree updates:
* Simplification: Updated calculateGainForSplit to take aggregates for a
single (feature, split) pair.
* Internal doc: findAggForOrderedFeatureClassification
commit a87e08f1e5999c31b956a34617f88ff9a50775ae
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T18:34:12Z
Merge remote-tracking branch 'upstream/master' into dt-opt1
commit 8464a6efd644daf9954ba43c9790ec304f94e029
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T19:26:57Z
Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.
Removed debugging println calls from DecisionTree. Made TreePoint extend
Serialiable
commit e66f1b1cb2252dab1f847f2c24623baab40627fc
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T19:58:22Z
TreePoint
* Updated doc
* Made some methods private
Changed timer to report time in seconds.
commit d03608949e19c53596b4f6cc09d9f68011184d68
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T20:07:14Z
Print timing info to logDebug.
commit 430d782294a08f63535e2ecce167703021e1fe44
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T23:09:14Z
Added more debug info on binning error. Added some docs.
commit 356dabac6bad8b2e2a9f7b90aaae80d987c113dc
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T23:57:13Z
Merge branch 'dt-opt1' into dt-opt2
Conflicts:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
commit 26d10dd58ee218102bd205c1e6d68fda5a45cf4b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-15T00:44:08Z
Removed tree/model/Filter.scala since no longer used. Removed debugging
println calls in DecisionTree.scala.
commit 2d2aaaffd630e5a9376a321ac5b7d2a64bcd13e2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-15T16:46:16Z
Merge remote-tracking branch 'upstream/master' into dt-opt1
commit 6b5651e7671315f78aef42344ab514e3cf8052df
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-15T19:28:47Z
Updates based on code review. 1 major change: persisting to memory + disk,
not just memory.
Details:
DecisionTree
* Changed: .cache() -> .persist(StorageLevel.MEMORY_AND_DISK)
** This gave major performance improvements on small tests. E.g., 500K
examples, 500 features, depth 5, on MacBook, took 292 sec with cache() and 112
when using disk as well.
* Change for to while loops
* Small cleanups
TimeTracker
* Removed useless timing in DecisionTree
TreePoint
* Renamed features to binnedFeatures
commit 5f2dec2e3c5e17ce79cef119ef039323dbd73942
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-15T19:43:50Z
Fixed scalastyle issue in TreePoint
commit f40381c0ecf76506f7b727d1d6ca715fe7716065
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-15T21:28:07Z
Merge branch 'dt-opt1' into dt-opt2
Conflicts:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
Merge is OK except one DT Suite test to fix.
commit 797f68a13323bbebac099ea1a98654fb7b2984a0
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-15T22:19:01Z
Fixed DecisionTreeSuite bug for training second level. Needed to update
treePointToNodeIndex with groupShift.
commit 931a3a714e44ab1138dac18bc3b497892d36e3e5
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-15T22:22:26Z
Merge remote-tracking branch 'upstream/master' into dt-opt2
commit 6a38f48322b8dfbf4d866452c7952dae6d09397a
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-16T01:28:30Z
Added DTMetadata class for cleaner code
commit db0d7732a8cece3dc0188923e6c2939c09ee686a
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-16T01:35:16Z
scala style fix
commit ac0b9f84ededb9aaee477f439f711d9be8e890bd
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-16T04:53:38Z
Small updates based on code review.
Main change: Now using << instead of math.pow.
commit 3726d2003e681e569a6d1fbf4af65909500f1b80
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-17T03:55:32Z
Small code improvements based on code review.
commit a0ed0daa4c3622e19626de7aa3b29e07c6015ff2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-17T05:47:35Z
Renamed DTMetadata to DecisionTreeMetadata. Small doc updates.
commit 66d076f2a042fe21558fec022a389800d514b5d2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-18T03:52:35Z
Merge remote-tracking branch 'upstream/master' into dt-opt2
commit 85bbc1fa6f9813661998e4a051670a3e59e1f679
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-18T04:49:52Z
Merge remote-tracking branch 'upstream/master' into dt-opt2
commit b7b2922b1bc4a6192b36d9a930df86b5d5d6d13f
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-18T04:51:04Z
Fixed bug in python example decision_tree_runner.py with missing argument.
Changed main DecisionTree aggregate to treeAggregate.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]