[
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zikun updated SPARK-32096:
--------------------------
Description:
Spark SQL rank window function needs to sort the data in each window partition,
and it relies on the execution operator[
|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
[*_SortExec_*
|https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
do the sort. During sorting, the window partition key is also put at the front
of the sort order and thus it brings unnecessary comparisons on the partition
key. Instead, we can group the rows by partition key first, and inside each
group we sort the rows without comparing the partition key.
In Spark SQL, there are two types of sort execution, *_SortExec_* and
*_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution and
it does not support top-N sort. *_TakeOrderedAndProjectExec_* is the execution
for top-N sort in Spark. Spark SQL rank window function needs to sort the data
locally and it relies on the execution plan *_SortExec_* to sort the data in
each physical data partition. When the filter of the window rank (e.g. rank <=
100) is specified in a user's query, the filter can actually be pushed down to
the SortExec and then we let SortExec operates top-N sort. Right now SortExec
does not support top-N sort and we need to extend the capability of SortExec to
support top-N sort.
was:
In Spark SQL, there are two types of sort execution, *_SortExec_* and
*_TakeOrderedAndProjectExec_* .
*_SortExec_* is a general sorting execution and it does not support top-N sort.
*_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark.
Spark SQL rank window function needs to sort the data locally and it relies on
the execution plan *_SortExec_* to sort the data in each physical data
partition. When the filter of the window rank (e.g. rank <= 100) is specified
in a user's query, the filter can actually be pushed down to the SortExec and
then we let SortExec operates top-N sort.
Right now SortExec does not support top-N sort and we need to extend the
capability of SortExec to support top-N sort.
Or if SortExec is not considered as the right execution choice, we can create a
new execution plan called topNSortExec to do top-N sort in each local partition
if a filter on the window rank is specified.
> Improve sorting performance for Spark SQL rank window function
> ---------------------------------------------------------------
>
> Key: SPARK-32096
> URL: https://issues.apache.org/jira/browse/SPARK-32096
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Environment: Any environment that supports Spark.
> Reporter: Zikun
> Priority: Major
> Attachments: windowSortPerf (1).docx
>
>
> Spark SQL rank window function needs to sort the data in each window
> partition, and it relies on the execution operator[
> |https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
> [*_SortExec_*
> |https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
> do the sort. During sorting, the window partition key is also put at the
> front of the sort order and thus it brings unnecessary comparisons on the
> partition key. Instead, we can group the rows by partition key first, and
> inside each group we sort the rows without comparing the partition key.
>
> In Spark SQL, there are two types of sort execution, *_SortExec_* and
> *_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution
> and it does not support top-N sort. *_TakeOrderedAndProjectExec_* is the
> execution for top-N sort in Spark. Spark SQL rank window function needs to
> sort the data locally and it relies on the execution plan *_SortExec_* to
> sort the data in each physical data partition. When the filter of the window
> rank (e.g. rank <= 100) is specified in a user's query, the filter can
> actually be pushed down to the SortExec and then we let SortExec operates
> top-N sort. Right now SortExec does not support top-N sort and we need to
> extend the capability of SortExec to support top-N sort.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]