[GitHub] spark pull request: [SPARK-6464][Core]Add a function named 'proces...

SaintBacchus Mon, 23 Mar 2015 20:33:00 -0700

GitHub user SaintBacchus opened a pull request:

    https://github.com/apache/spark/pull/5152


    [SPARK-6464][Core]Add a function named 'processCoalesce' in RDD to support 
partition combination

    Nowadays, the transformation coalesce was always used to expand or reduce 
the number of the partition in order to gain a good performance.
    But coalesce can't make sure that the child partition will be executed in 
the same executor as the parent partition. And this will lead to have a large 
network transfer.
    In some scenario such as I mentioned in the title small and cached rdd, we 
want to coalesce all the partition in the same executor into one partition and 
make sure the child partition will be executed in this executor. It can avoid 
network transfer and reduce the scheduler of the Tasks and also can reused the 
cpu core to do other job. 
    In this scenario, our performance had improved 20% than before.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/SaintBacchus/spark ProcessCoalesce

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5152
    
----
commit 2cc3daacb8caf45f71e37d4515fa226f87bfbabe
Author: huangzhaowei <[email protected]>
Date:   2015-03-24T02:19:32Z

    Add a function named 'processCoalesce' in RDD to support partition 
combination

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6464][Core]Add a function named 'proces...

Reply via email to