GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/19666

    [SPARK-22451][ML] Reduce decision tree aggregate size for unordered 
features from O(2^numCategories) to O(numCategories)

    ## What changes were proposed in this pull request?
    
    We do not need generate all possible splits for unordered features before 
aggregate,
    
    - Change `mixedBinSeqOp` (which running on executor), for each unordered 
feature, we do the same stat with ordered features. so for unordered features, 
we only need O(numCategories) space for this feature stat.
    
    - After driver side get the aggregate result, generate all possible split 
combinations, and compute the best split.
    
    This will reduce decision tree aggregate size for each unordered feature 
from `O(2^numCategories)` to `O(numCategories)`, `numCategories` is the arity 
of this unordered feature.
    
    This also reduce the cpu cost in executor side. Reduce time complexity for 
this unordered feature from `O(numPoints * 2^numCategories)` to `O(numPoints)`.
    
    This won't increase time complexity for unordered features best split 
computing in driver side.
    
    ## How was this patch tested?
    
    UT added.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark improve_decision_tree

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19666.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19666
    
----
commit 93c7e0fdf8a16a4e14adfe1891b0a4dc018f6de8
Author: WeichenXu <[email protected]>
Date:   2017-11-05T13:35:45Z

    init pr

commit e79abfd2a231ff29abc4be1723002542c85189df
Author: WeichenXu <[email protected]>
Date:   2017-11-06T09:41:04Z

    add unit test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to