[ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zikun updated SPARK-32096:
--------------------------
    Description: 
Spark SQL rank window function needs to sort the data in each window partition, 
and it relies on the execution operator[ 
|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
 [*_SortExec_* 
|https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
 do the sort. During sorting, the window partition key is also put at the front 
of the sort order and thus it brings unnecessary comparisons on the partition 
key. Instead, we can group the rows by partition key first, and inside each 
group we sort the rows without comparing the partition key.​ 

 

In Spark SQL, there are two types of sort execution, *_SortExec_* and 
*_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution and 
it does not support top-N sort. ​*_TakeOrderedAndProjectExec_* is the execution 
for top-N sort in Spark. Spark SQL rank window function needs to sort the data 
locally and it relies on the execution plan *_SortExec_* to sort the data in 
each physical data partition. When the filter of the window rank (e.g. rank <= 
100) is specified in a user's query, the filter can actually be pushed down to 
the SortExec and then we let SortExec operates top-N sort. Right now SortExec 
does not support top-N sort and we need to extend the capability of SortExec to 
support top-N sort. 

  was:
In Spark SQL, there are two types of sort execution, *_SortExec_* and 
*_TakeOrderedAndProjectExec_* . 

*_SortExec_* is a general sorting execution and it does not support top-N sort. 
​

*_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. 

Spark SQL rank window function needs to sort the data locally and it relies on 
the execution plan *_SortExec_* to sort the data in each physical data 
partition. When the filter of the window rank (e.g. rank <= 100) is specified 
in a user's query, the filter can actually be pushed down to the SortExec and 
then we let SortExec operates top-N sort. 

Right now SortExec does not support top-N sort and we need to extend the 
capability of SortExec to support top-N sort. 

Or if SortExec is not considered as the right execution choice, we can create a 
new execution plan called topNSortExec to do top-N sort in each local partition 
if a filter on the window rank is specified. 


> Improve sorting performance for Spark SQL rank window function​
> ---------------------------------------------------------------
>
>                 Key: SPARK-32096
>                 URL: https://issues.apache.org/jira/browse/SPARK-32096
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Any environment that supports Spark.
>            Reporter: Zikun
>            Priority: Major
>         Attachments: windowSortPerf (1).docx
>
>
> Spark SQL rank window function needs to sort the data in each window 
> partition, and it relies on the execution operator[ 
> |https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
>  [*_SortExec_* 
> |https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
>  do the sort. During sorting, the window partition key is also put at the 
> front of the sort order and thus it brings unnecessary comparisons on the 
> partition key. Instead, we can group the rows by partition key first, and 
> inside each group we sort the rows without comparing the partition key.​ 
>  
> In Spark SQL, there are two types of sort execution, *_SortExec_* and 
> *_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution 
> and it does not support top-N sort. ​*_TakeOrderedAndProjectExec_* is the 
> execution for top-N sort in Spark. Spark SQL rank window function needs to 
> sort the data locally and it relies on the execution plan *_SortExec_* to 
> sort the data in each physical data partition. When the filter of the window 
> rank (e.g. rank <= 100) is specified in a user's query, the filter can 
> actually be pushed down to the SortExec and then we let SortExec operates 
> top-N sort. Right now SortExec does not support top-N sort and we need to 
> extend the capability of SortExec to support top-N sort. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to