bersprockets opened a new pull request, #36951:
URL: https://github.com/apache/spark/pull/36951
### What changes were proposed in this pull request?
Change `LimitPushDownThroughWindow` so that it no longer supports pushing
down a limit through a window using percent_rank.
### Why are the changes needed?
Given a query with a limit of _n_ rows, and a window whose child produces
_m_ rows, percent_rank will label the _nth_ row as 100% rather than the _mth_
row.
This behavior conflicts with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268.
#### Example
Assume this data:
```
create table t1 stored as parquet as
select *
from range(101);
```
And also assume this query:
```
select id, percent_rank() over (order by id) as pr
from t1
limit 3;
```
With Spark 3.2.1, 3.3.0, and master, the limit is applied before the
percent_rank:
```
0 0.0
1 0.5
2 1.0
```
With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied
_after_ percent_rank:
Spark 3.1.3:
```
0 0.0
1 0.01
2 0.02
```
Hive 2.3.9:
```
0: jdbc:hive2://localhost:10000> select id, percent_rank() over (order by
id) as pr
from t1
limit 3;
. . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . .> WARNING:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or
using Hive 1.X releases.
+-----+-------+
| id | pr |
+-----+-------+
| 0 | 0.0 |
| 1 | 0.01 |
| 2 | 0.02 |
+-----+-------+
3 rows selected (4.621 seconds)
0: jdbc:hive2://localhost:10000>
```
Prestodb 0.268:
```
id | pr
----+------
0 | 0.0
1 | 0.01
2 | 0.02
(3 rows)
```
With this PR, Spark will apply the limit after percent_rank.
### Does this PR introduce _any_ user-facing change?
No (besides changing percent_ranks behavior to be more like Spark 3.1.3,
Hive, and Prestodb).
### How was this patch tested?
New unit tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]