Guibin Zhang created SPARK-36770: ------------------------------------ Summary: Improve run-time performance for window function first and last against UnboundedFollowing window frame Key: SPARK-36770 URL: https://issues.apache.org/jira/browse/SPARK-36770 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 2.4.0 Reporter: Guibin Zhang
*Context* The *UnboundedFollowingWindowFunctionFrame* has the time complexity of *O(N^2)*, N is the number of rows in the current partition, more specific the complexity is O(N* (N - 1)/2). What happens internally in UnboundedFollowingWindowFunctionFrame: In the window frame, while processing each incoming row, it will go through current row till the end of partition to do re-calculation. This process will be repeated on each incoming row, which causes the high run-time complexity. But UnboundedPrecedingWindowFunctionFrame has much better time complexity O(N), N is the number of rows in the current partition. *What is the idea of the improvement?* Give the big time complexity difference between UnboundedFollowingWindowFunctionFrame and UnboundedPrecedingWindowFunctionFrame, we can do following conversions to improve the time complexity of first() and last() from O(N^2) to O(N) {code:java} case 1: first() OVER(PARTITION BY colA ORDER BY colB ASC ROWS UNBOUNDED FOLLOWING) converts to last() OVER(PARTITION BY colA ORDER BY colB DEAC ROWS UNBOUNDED PRECEDING) case 2: last() OVER(PARTITION BY colA ORDER BY colB ASC ROWS UNBOUNDED FOLLOWING) converts to first() OVER(PARTITION BY colA ORDER BY colB DESC ROWS UNBOUNDED PRECEDING) {code} *Summary* Replace "UNBOUNDED FOLLOWING" with "UNBOUNDED PRECEDING", and flip the ORDER BY for the window functions first() and last(). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org