beliefer opened a new pull request, #40142:
URL: https://github.com/apache/spark/pull/40142
### What changes were proposed in this pull request?
Sometimes, the SQL exists filter which condition compares the window
function `ROW_NUMBER()` with number. For example,
```
SELECT *,
ROW_NUMBER() OVER(ORDER BY a) AS rn
FROM Tab1
WHERE rn <= 5
```
We can create a `Limit` as the parent node of `Window` node. Such as:
`Limit(Literal(5), Window)` and the optimizer rule `LimitPushDownThroughWindow`
will push down the `Limit` as the child of `Window`. After this optimization,
Spark executes top n and reduce the data size before shuffle. We can consider
the SQL adjusted as the below.
```
SELECT *,
ROW_NUMBER() OVER(ORDER BY a) AS rn
FROM
(SELECT *
FROM Tab1
ORDER BY a LIMIT 5) t
```
In short, it supports following pattern:
```
SELECT (... row_number()
OVER (
ORDER BY ... ) AS rn)
WHERE rn (==|<|<=) k
AND other conditions
```
### Why are the changes needed?
Improve the performance by infer window limit and push down it through
window when partitionSpec is empty.
### Does this PR introduce _any_ user-facing change?
'No'.
New feature.
### How was this patch tested?
The top n (Limit + Sort) have better performance than WindowGroupLimit if
the window function is RowNumber and Window partitionSpec is empty.
The new micro benchmark
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark Top-K: Best
Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-----------------------------------------------------------------------------------------------------------------------------------------------
ROW_NUMBER (PARTITION: , WindowGroupLimit: false)
11054 11684 406 1.9 527.1 1.0X
ROW_NUMBER (PARTITION: , WindowGroupLimit: true)
1737 1772 23 12.1 82.8 6.4X
ROW_NUMBER (PARTITION: PARTITION BY b, WindowGroupLimit: false)
24570 24698 97 0.9 1171.6 0.4X
ROW_NUMBER (PARTITION: PARTITION BY b, WindowGroupLimit: true)
6645 6916 219 3.2 316.9 1.7X
RANK (PARTITION: , WindowGroupLimit: false)
11590 11917 237 1.8 552.7 1.0X
RANK (PARTITION: , WindowGroupLimit: true)
2697 2865 107 7.8 128.6 4.1X
RANK (PARTITION: PARTITION BY b, WindowGroupLimit: false)
25357 25476 94 0.8 1209.1 0.4X
RANK (PARTITION: PARTITION BY b, WindowGroupLimit: true)
6625 6755 92 3.2 315.9 1.7X
DENSE_RANK (PARTITION: , WindowGroupLimit: false)
11876 12137 184 1.8 566.3 0.9X
DENSE_RANK (PARTITION: , WindowGroupLimit: true)
2678 2961 147 7.8 127.7 4.1X
DENSE_RANK (PARTITION: PARTITION BY b, WindowGroupLimit: false)
25407 27601 700 0.8 1211.5 0.4X
DENSE_RANK (PARTITION: PARTITION BY b, WindowGroupLimit: true)
6841 7008 184 3.1 326.2 1.6X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]