[
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827619#comment-15827619
]
Hyukjin Kwon commented on SPARK-19222:
--------------------------------------
Can you make the indentation pretty if you want to remove \{code\} ... \{code\}?
> Limit Query Performance issue
> -----------------------------
>
> Key: SPARK-19222
> URL: https://issues.apache.org/jira/browse/SPARK-19222
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Environment: Linux/Windows
> Reporter: Sujith
> Priority: Minor
>
> When limit is being added in the middle of the physical plan there will
> be possibility of memory bottleneck
> if the limit value is too large and system will try to aggregate all the
> partition limit values as part of single partition.
> Description:
> Eg:
> create table src_temp as select * from src limit n; (n=10000000)
> == Physical Plan ==
> ExecutedCommand
> +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2,
> InsertIntoHiveTable]
> +- GlobalLimit 10000000
> +- LocalLimit 10000000
> +- Project [imei#101, age#102, task#103L, num#104, level#105,
> productdate#106, name#107, point#108]
> +- SubqueryAlias hive
> +-
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
> csv |
> As shown in above plan when the limit comes in middle,there can be two
> types of performance bottlenecks.
> scenario 1: when the partition count is very high and limit value is small
> scenario 2: when the limit value is very large
> Eg,current scenario based on following sample data of limit count is 10000000
> and partition count 5
> Local Limit -------- > |partition 1||partition 2||partition 3||partition
> 4||partition 5|
> ----------------------> <<take n>><<take n>><<take n>><<take n>><<take n>>
> |
> Shuffle Exchange(into single
> partition)
> |
> Global Limit -------- > << take n>> (all the
> partition data will be grouped in single partition)
>
>
> as the above scenario occurs where system will shuffle and try to group the
> limit data from all partition
> to single partition which will induce performance bottleneck.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]