[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432910#comment-17432910 ]
Apache Spark commented on SPARK-37099: -------------------------------------- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/34367 > Impl a rank-based filter to optimize top-k computation > ------------------------------------------------------ > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.3.0 > Reporter: zhengruifeng > Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 80% usage of window function follows this > pattern: > {code:java} > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org