[ https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621965#comment-16621965 ]
Liang-Chi Hsieh commented on SPARK-19355: ----------------------------------------- [~cloud_fan] For this, I think we should first have the API we discussed to retrieve data statistics back to driver for an RDD. I will create another ticket for that. > Use map output statistices to improve global limit's parallelism > ---------------------------------------------------------------- > > Key: SPARK-19355 > URL: https://issues.apache.org/jira/browse/SPARK-19355 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Liang-Chi Hsieh > Assignee: Liang-Chi Hsieh > Priority: Major > Fix For: 2.4.0 > > > A logical Limit is performed actually by two physical operations LocalLimit > and GlobalLimit. > In most of time, before GlobalLimit, we will perform a shuffle exchange to > shuffle data to single partition. When the limit number is very big, we > shuffle a lot of data to a single partition and significantly reduce > parallelism, except for the cost of shuffling. > This change tries to perform GlobalLimit without shuffling data to single > partition. Instead, we perform the map stage of the shuffling and collect the > statistics of the number of rows in each partition. Shuffled data are > actually all retrieved locally without from remote executors. > Once we get the number of output rows in each partition, we only take the > required number of rows from the locally shuffled data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org