GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/1048

    [SPARK-2042] Prevent unnecessary shuffle triggered by take()

    This PR implements `take()` on a `SchemaRDD` by inserting a logical limit 
that is followed by a `collect()`. This is also accompanied by adding a 
catalyst optimizer rule for collapsing adjacent limits. Doing so prevents an 
unnecessary shuffle that is sometimes triggered by `take()`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1048
    
----
commit 8d42d0308338d0b584a2a1c1e5e89d6ee2c18938
Author: Sameer Agarwal <[email protected]>
Date:   2014-06-11T00:22:57Z

    Implement trigger() as limit() followed by collect()
    
    Implement trigger() as limit() followed by collect()gdfg

commit a0ff7c45d2d92367a36365c59377c3c7e2e730d2
Author: Sameer Agarwal <[email protected]>
Date:   2014-06-11T00:26:11Z

    Adding catalyst rule to fold two consecutive limits
    
    Creating a LimitFolding Batch
    
    ssdg

commit 8f946a2002b10f0531ccfa5accd5b683a5e98808
Author: Sameer Agarwal <[email protected]>
Date:   2014-06-11T06:09:50Z

    Added limit folding tests
    
    Adding tests
    
    Refactoring

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to