Weichen Xu created SPARK-22451:
----------------------------------
Summary: Reduce decision tree aggregate size for unordered
features from O(2^numCategories) to O(numCategories)
Key: SPARK-22451
URL: https://issues.apache.org/jira/browse/SPARK-22451
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 2.2.0
Reporter: Weichen Xu
Do not need generate all possible splits for unordered features before
aggregate,
in aggregete (executor side):
1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat with
ordered features. so for unordered features, we only need O(numCategories)
space for this feature stat.
2. After driver side get the aggregate result, generate all possible split
combinations, and compute the best split.
This will reduce decision tree aggregate size for each unordered feature from
O(2^numCategories) to O(numCategories), `numCategories` is the arity of this
unordered feature.
This also reduce the cpu cost in executor side. Reduce time complexity for this
unordered feature from O(numPoints * 2^numCategories) to O(numPoints).
This won't increase time complexity for unordered features best split computing
in driver side.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]