GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/1950
[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered
feature bug fix
DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered
categorical features
(3) Timing for DecisionTree internals
Details:
(1) TreePoint representation to avoid binning multiple times
[https://issues.apache.org/jira/browse/SPARK-3022]
Added private[tree] TreePoint class for representing binned feature values.
The input RDD of LabeledPoint is converted to the TreePoint representation
initially and then cached. This avoids the previous problem of re-computing
bins multiple times.
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered
categorical features
[https://issues.apache.org/jira/browse/SPARK-3041]
isSampleValid used to treat unordered categorical features incorrectly: It
treated the bins as if indexed by featured values, rather than by subsets of
values/categories.
* exhibited for unordered features (multi-class classification with
categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.
(3) Timing for DecisionTree internals
Added tree/impl/TimeTracker.scala class which is private[tree] for now, for
timing key parts of DT code.
Prints timing info via logDebug.
CC: @mengxr @manishamde @chouqin Very similar update, with one bug fix.
Many apologies for the conflicting update, but I hope that a few more
optimizations I have on the way (which depend on this update) will prove
valuable to you: SPARK-3042 and SPARK-3043
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark dt-opt1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1950.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1950
----
commit a95bc22e648d01158d3a4fd597059135e1302266
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-05T18:17:28Z
timing for DecisionTree internals
commit 511ec85fbe4c4463d8e600fabc5d54c5b2bd8417
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-06T01:16:19Z
Merge remote-tracking branch 'upstream/master' into dt-timing
commit bcf874a7444303ac7dc14cc5a36890cec45a8359
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-07T21:53:22Z
Merge remote-tracking branch 'upstream/master' into dt-timing
Conflicts:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
commit f61e9d227233679ab826e38210376e7050da9b6b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T07:35:06Z
Merge remote-tracking branch 'upstream/master' into dt-timing
commit 3211f027c1a41f8eaa4eea4e90073216a8474c4e
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T16:46:12Z
Optimizing DecisionTree
* Added TreePoint representation to avoid calling findBin multiple times.
* (not working yet, but debugging)
commit 0f676e2e0ae02e54387a255ac9f64d3c7265d152
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-08T21:12:52Z
Optimizations + Bug fix for DecisionTree
Optimization: Added TreePoint representation so we only call findBin once
for each example, feature.
Also, calculateGainsForAllNodeSplits now only searches over actual splits,
not empty/unused ones.
BUG FIX: isSampleValid
* isSampleValid used to treat unordered categorical features incorrectly:
It treated the bins as if indexed by featured values, rather than by subsets of
values/categories.
* exhibited for unordered features (multi-class classification with
categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.
Also: some commented-out debugging println calls in DecisionTree, to be
removed later
commit a87e08f1e5999c31b956a34617f88ff9a50775ae
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T18:34:12Z
Merge remote-tracking branch 'upstream/master' into dt-opt1
commit 8464a6efd644daf9954ba43c9790ec304f94e029
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T19:26:57Z
Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.
Removed debugging println calls from DecisionTree. Made TreePoint extend
Serialiable
commit e66f1b1cb2252dab1f847f2c24623baab40627fc
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T19:58:22Z
TreePoint
* Updated doc
* Made some methods private
Changed timer to report time in seconds.
commit d03608949e19c53596b4f6cc09d9f68011184d68
Author: Joseph K. Bradley <[email protected]>
Date: 2014-08-14T20:07:14Z
Print timing info to logDebug.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]