[
https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan updated SPARK-19355:
--------------------------------
Fix Version/s: (was: 2.4.0)
> Use map output statistices to improve global limit's parallelism
> ----------------------------------------------------------------
>
> Key: SPARK-19355
> URL: https://issues.apache.org/jira/browse/SPARK-19355
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Liang-Chi Hsieh
> Assignee: Liang-Chi Hsieh
> Priority: Major
>
> A logical Limit is performed actually by two physical operations LocalLimit
> and GlobalLimit.
> In most of time, before GlobalLimit, we will perform a shuffle exchange to
> shuffle data to single partition. When the limit number is very big, we
> shuffle a lot of data to a single partition and significantly reduce
> parallelism, except for the cost of shuffling.
> This change tries to perform GlobalLimit without shuffling data to single
> partition. Instead, we perform the map stage of the shuffling and collect the
> statistics of the number of rows in each partition. Shuffled data are
> actually all retrieved locally without from remote executors.
> Once we get the number of output rows in each partition, we only take the
> required number of rows from the locally shuffled data.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]