GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19666
[SPARK-22451][ML] Reduce decision tree aggregate size for unordered
features from O(2^numCategories) to O(numCategories)
## What changes were proposed in this pull request?
We do not need generate all possible splits for unordered features before
aggregate,
- Change `mixedBinSeqOp` (which running on executor), for each unordered
feature, we do the same stat with ordered features. so for unordered features,
we only need O(numCategories) space for this feature stat.
- After driver side get the aggregate result, generate all possible split
combinations, and compute the best split.
This will reduce decision tree aggregate size for each unordered feature
from `O(2^numCategories)` to `O(numCategories)`, `numCategories` is the arity
of this unordered feature.
This also reduce the cpu cost in executor side. Reduce time complexity for
this unordered feature from `O(numPoints * 2^numCategories)` to `O(numPoints)`.
This won't increase time complexity for unordered features best split
computing in driver side.
## How was this patch tested?
UT added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/WeichenXu123/spark improve_decision_tree
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19666.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19666
----
commit 93c7e0fdf8a16a4e14adfe1891b0a4dc018f6de8
Author: WeichenXu <[email protected]>
Date: 2017-11-05T13:35:45Z
init pr
commit e79abfd2a231ff29abc4be1723002542c85189df
Author: WeichenXu <[email protected]>
Date: 2017-11-06T09:41:04Z
add unit test
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]