GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/16633

    [SPARK-19274][SQL] Make GlobalLimit without shuffling data to single 
partition

    ## What changes were proposed in this pull request?
    
    A logical `Limit` is performed actually by two physical operations 
`LocalLimit` and `GlobalLimit`.
    
    In most of time, before `GlobalLimit`, we will perform a shuffle exchange 
to shuffle data to single partition. When the limit number is not trivially 
small, this shuffling is costing.
    
    This change tried to perform `GlobalLimit` without shuffling data to single 
partition. The approach is similar to `SparkPlan.executeTake`. It iterates part 
of partitions until it reaches enough data.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 globallimit-without-shuffle

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16633.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16633
    
----
commit b26488f77acf442db768b41f94bbda9773b523a2
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-01-18T08:21:17Z

    Make GlobalLimit without shuffling data to single partition.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to