[ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-32096:
--------------------------
    Attachment: windowSortPerf (1).docx

> Improve sorting performance for Spark SQL rank window function​
> ---------------------------------------------------------------
>
>                 Key: SPARK-32096
>                 URL: https://issues.apache.org/jira/browse/SPARK-32096
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Any environment that supports Spark.
>            Reporter: Zikun
>            Priority: Major
>         Attachments: windowSortPerf (1).docx
>
>
> Spark SQL rank window function needs to sort the data in each window 
> partition, and it relies on the execution operator[ 
> |https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
>  [*_SortExec_* 
> |https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
>  do the sort. During sorting, the window partition key is also put at the 
> front of the sort order and thus it brings unnecessary comparisons on the 
> partition key. Instead, we can group the rows by partition key first, and 
> inside each group we sort the rows without comparing the partition key.​ 
>  
> In Spark SQL, there are two types of sort execution, *_SortExec_* and 
> *_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution 
> and it does not support top-N sort. ​*_TakeOrderedAndProjectExec_* is the 
> execution for top-N sort in Spark. Spark SQL rank window function needs to 
> sort the data locally and it relies on the execution plan *_SortExec_* to 
> sort the data in each physical data partition. When the filter of the window 
> rank (e.g. rank <= 100) is specified in a user's query, the filter can 
> actually be pushed down to the SortExec and then we let SortExec operates 
> top-N sort. Right now SortExec does not support top-N sort and we need to 
> extend the capability of SortExec to support top-N sort. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to