[jira] [Updated] (SPARK-19355) Use map output statistices to improve global limit's parallelism

Wenchen Fan (JIRA) Wed, 10 Oct 2018 09:42:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wenchen Fan updated SPARK-19355:
--------------------------------
    Fix Version/s:     (was: 2.4.0)

> Use map output statistices to improve global limit's parallelism
> ----------------------------------------------------------------
>
>                 Key: SPARK-19355
>                 URL: https://issues.apache.org/jira/browse/SPARK-19355
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Liang-Chi Hsieh
>            Assignee: Liang-Chi Hsieh
>            Priority: Major
>
> A logical Limit is performed actually by two physical operations LocalLimit 
> and GlobalLimit.
> In most of time, before GlobalLimit, we will perform a shuffle exchange to 
> shuffle data to single partition. When the limit number is very big, we 
> shuffle a lot of data to a single partition and significantly reduce 
> parallelism, except for the cost of shuffling.
> This change tries to perform GlobalLimit without shuffling data to single 
> partition. Instead, we perform the map stage of the shuffling and collect the 
> statistics of the number of rows in each partition. Shuffled data are 
> actually all retrieved locally without from remote executors.
> Once we get the number of output rows in each partition, we only take the 
> required number of rows from the locally shuffled data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-19355) Use map output statistices to improve global limit's parallelism

Reply via email to