[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

mengxr Wed, 23 Jul 2014 19:53:24 -0700

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/1562


    [SPARK-2568] RangePartitioner should run only one job if data is balanced

    As of Spark 1.0, RangePartitioner goes through data twice: once to compute 
the count and once to do sampling. As a result, to do sortByKey, Spark goes 
through data 3 times (once to count, once to sample, and once to sort).
    
    `RangePartitioner` should go through data only once, collecting samples 
from input partitions as well as counting. If the data is balanced, this should 
give us a good sketch. If we see big partitions, we re-sample from them in 
order to collect enough items.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark range-partitioner

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1562.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1562
    
----
commit 69400105159f64f0672da896c313e2a22525d219
Author: Reynold Xin <[email protected]>
Date:   2014-07-18T05:00:13Z

    Reservoir sampling implementation.

commit badf20ded132d985f6d12000a876316af7287877
Author: Reynold Xin <[email protected]>
Date:   2014-07-18T05:29:53Z

    Renamed the method.

commit 17bcbf3982fabc027900c9ce791ae3233ba66700
Author: Reynold Xin <[email protected]>
Date:   2014-07-18T07:39:23Z

    Added seed.

commit 06ac2ec9037cc1a85a0e2fcdb2701296849bdbae
Author: Xiangrui Meng <[email protected]>
Date:   2014-07-21T21:25:53Z

    Merge remote-tracking branch 'apache/master' into range-part

commit cc12f47f670aa06c06b6309473b6441e989012dc
Author: Xiangrui Meng <[email protected]>
Date:   2014-07-22T08:02:28Z

    Merge remote-tracking branch 'apache/master' into range-part

commit 9ee9992f8581557ca410cf38a88557d3fd3fe21a
Author: Xiangrui Meng <[email protected]>
Date:   2014-07-23T18:16:01Z

    update range partitioner to run only one job on roughly balanced data

commit 60be09e9e1e8f9fa7ebb039fa11d925bbce48a08
Author: Xiangrui Meng <[email protected]>
Date:   2014-07-24T02:41:33Z

    remove importance sampler

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

Reply via email to