[
https://issues.apache.org/jira/browse/SPARK-44448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan reassigned SPARK-44448:
-----------------------------------
Assignee: Jack Chen
> Wrong results for dense_rank() <= k from InferWindowGroupLimit and
> DenseRankLimitIterator
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-44448
> URL: https://issues.apache.org/jira/browse/SPARK-44448
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0
> Reporter: Jack Chen
> Assignee: Jack Chen
> Priority: Major
>
> Top-k filters on a dense_rank() window function return wrong results, due to
> a bug in optimization InferWindowGroupLimit, specifically in the code for
> DenseRankLimitIterator, introduced in
> https://issues.apache.org/jira/browse/SPARK-37099.
> Repro:
> {code:java}
> create or replace temp view t1 (p, o) as values (1, 1), (1, 1), (1, 2), (2,
> 1), (2, 1), (2, 2);
> select * from (select *, dense_rank() over (partition by p order by o) as rnk
> from t1) where rnk = 1;{code}
> Spark result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]{code}
> Correct result:
> {code:java}
> [1,1,1]
> [1,1,1]
> [2,1,1]
> [2,1,1]{code}
>
> The bug is in {{{}DenseRankLimitIterator{}}}, it fails to reset state
> properly when transitioning from one window partition to the next. {{reset}}
> only resets {{{}rank = 0{}}}, what it is missing is to reset
> {{{}currentRankRow = null{}}}. This means that when processing the second and
> later window partitions, the rank incorrectly gets incremented based on
> comparing the ordering of the last row of the previous partition to the first
> row of the new partition.
> This means that a dense_rank window func that has more than one window
> partition and more than one row with dense_rank = 1 in the second or later
> partitions can give wrong results when optimized.
> ({{{}RankLimitIterator{}}} narrowly avoids this bug by happenstance, the
> first row in the new partition will try to increment rank, but increment it
> by the value of count which is 0, so it happens to work by accident).
> Unfortunately, tests for the optimization only had a single row per rank, so
> did not catch the bug as the bug requires multiple rows per rank.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]