Weichen Xu created SPARK-22451:
----------------------------------

             Summary: Reduce decision tree aggregate size for unordered 
features from O(2^numCategories) to O(numCategories)
                 Key: SPARK-22451
                 URL: https://issues.apache.org/jira/browse/SPARK-22451
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.2.0
            Reporter: Weichen Xu


Do not need generate all possible splits for unordered features before 
aggregate,
in aggregete (executor side):
1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat with 
ordered features. so for unordered features, we only need O(numCategories) 
space for this feature stat.
2. After driver side get the aggregate result, generate all possible split 
combinations, and compute the best split.

This will reduce decision tree aggregate size for each unordered feature from 
O(2^numCategories) to O(numCategories),  `numCategories` is the arity of this 
unordered feature.

This also reduce the cpu cost in executor side. Reduce time complexity for this 
unordered feature from O(numPoints * 2^numCategories) to O(numPoints).

This won't increase time complexity for unordered features best split computing 
in driver side.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to