GitHub user SaintBacchus opened a pull request:
https://github.com/apache/spark/pull/5152
[SPARK-6464][Core]Add a function named 'processCoalesce' in RDD to support
partition combination
Nowadays, the transformation coalesce was always used to expand or reduce
the number of the partition in order to gain a good performance.
But coalesce can't make sure that the child partition will be executed in
the same executor as the parent partition. And this will lead to have a large
network transfer.
In some scenario such as I mentioned in the title small and cached rdd, we
want to coalesce all the partition in the same executor into one partition and
make sure the child partition will be executed in this executor. It can avoid
network transfer and reduce the scheduler of the Tasks and also can reused the
cpu core to do other job.
In this scenario, our performance had improved 20% than before.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/SaintBacchus/spark ProcessCoalesce
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5152.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5152
----
commit 2cc3daacb8caf45f71e37d4515fa226f87bfbabe
Author: huangzhaowei <[email protected]>
Date: 2015-03-24T02:19:32Z
Add a function named 'processCoalesce' in RDD to support partition
combination
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]