[jira] [Commented] (SPARK-19355) Use map output statistices to improve global limit's parallelism

Liang-Chi Hsieh (JIRA) Thu, 20 Sep 2018 05:53:46 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621965#comment-16621965
 ]


Liang-Chi Hsieh commented on SPARK-19355:
-----------------------------------------

[~cloud_fan] For this, I think we should first have the API we discussed to 
retrieve data statistics back to driver for an RDD. I will create another 
ticket for that.

> Use map output statistices to improve global limit's parallelism
> ----------------------------------------------------------------
>
>                 Key: SPARK-19355
>                 URL: https://issues.apache.org/jira/browse/SPARK-19355
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Liang-Chi Hsieh
>            Assignee: Liang-Chi Hsieh
>            Priority: Major
>             Fix For: 2.4.0
>
>
> A logical Limit is performed actually by two physical operations LocalLimit 
> and GlobalLimit.
> In most of time, before GlobalLimit, we will perform a shuffle exchange to 
> shuffle data to single partition. When the limit number is very big, we 
> shuffle a lot of data to a single partition and significantly reduce 
> parallelism, except for the cost of shuffling.
> This change tries to perform GlobalLimit without shuffling data to single 
> partition. Instead, we perform the map stage of the shuffling and collect the 
> statistics of the number of rows in each partition. Shuffled data are 
> actually all retrieved locally without from remote executors.
> Once we get the number of output rows in each partition, we only take the 
> required number of rows from the locally shuffled data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19355) Use map output statistices to improve global limit's parallelism

Reply via email to