GitHub user lucio-yz opened a pull request:
https://github.com/apache/spark/pull/20472
[SPARK-22751][ML]Improve ML RandomForest shuffle performance
## What changes were proposed in this pull request?
As I mentioned in
[SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest),
there is a shuffle performance problem in ML Randomforest when train a RF in
high dimensional data.
The reason is that, in org.apache.spark.tree.impl.RandomForest, the
function findSplitsBySorting will actually flatmap a sparse vector into a dense
vector, then in groupByKey there will be a huge shuffle write size.
To avoid this, we can add a filter after flatmap, to filter out zero value.
And in function findSplitsForContinuousFeature, we can infer the number of zero
value by pass a parameter numInput to function findSplitsForContinuousFeature.
numInput is the number of samples.
In addition, if a feature only contains zero value, continuousSplits will
not has the key of feature id. So I add a check when using continuousSplits.
## How was this patch tested?
Ran model locally using spark-submit.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/lucio-yz/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20472.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20472
----
commit 50cb173dd34dc353c243b97f2686a8c545a03909
Author: lucio <576632108@...>
Date: 2018-02-01T09:47:52Z
fix mllib randomforest shuffle issue
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]