GitHub user lucio-yz opened a pull request:

    https://github.com/apache/spark/pull/20472

    [SPARK-22751][ML]Improve ML RandomForest shuffle performance

    ## What changes were proposed in this pull request?
    
    As I mentioned in 
[SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest),
 there is a shuffle performance problem in ML Randomforest when train a RF in 
high dimensional data. 
    
    The reason is that, in org.apache.spark.tree.impl.RandomForest, the 
function findSplitsBySorting will actually flatmap a sparse vector into a dense 
vector, then in groupByKey there will be a huge shuffle write size.
    
    To avoid this, we can add a filter after flatmap, to filter out zero value. 
And in function findSplitsForContinuousFeature, we can infer the number of zero 
value by pass a parameter numInput to function findSplitsForContinuousFeature. 
numInput is the number of samples.
    
    In addition, if a feature only contains zero value, continuousSplits will 
not has the key of feature id. So I add a check when using continuousSplits.
    
    ## How was this patch tested?
    Ran model locally using spark-submit.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lucio-yz/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20472
    
----
commit 50cb173dd34dc353c243b97f2686a8c545a03909
Author: lucio <576632108@...>
Date:   2018-02-01T09:47:52Z

    fix mllib randomforest shuffle issue

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to