GitHub user lucio-yz opened a pull request: https://github.com/apache/spark/pull/20472
[SPARK-22751][ML]Improve ML RandomForest shuffle performance ## What changes were proposed in this pull request? As I mentioned in [SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest), there is a shuffle performance problem in ML Randomforest when train a RF in high dimensional data. The reason is that, in org.apache.spark.tree.impl.RandomForest, the function findSplitsBySorting will actually flatmap a sparse vector into a dense vector, then in groupByKey there will be a huge shuffle write size. To avoid this, we can add a filter after flatmap, to filter out zero value. And in function findSplitsForContinuousFeature, we can infer the number of zero value by pass a parameter numInput to function findSplitsForContinuousFeature. numInput is the number of samples. In addition, if a feature only contains zero value, continuousSplits will not has the key of feature id. So I add a check when using continuousSplits. ## How was this patch tested? Ran model locally using spark-submit. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lucio-yz/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20472.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20472 ---- commit 50cb173dd34dc353c243b97f2686a8c545a03909 Author: lucio <576632108@...> Date: 2018-02-01T09:47:52Z fix mllib randomforest shuffle issue ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org