[ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-46446:
------------------------------
    Description: 
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

 

It's easy to repro with a query like
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 
2){code}
Correct result: empty set, see postgres: 
[https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 

Spark result: Array([2,2])

 

The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support

  was:
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

It's easy to repro with a query like
{code:java}
SELECT * 
FROM   emp 
join lateral   (SELECT dept.dept_name
               FROM   dept 
               WHERE  emp.dept_id = dept.dept_id
               LIMIT 5 OFFSET 3); {code}
The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support


> Correctness bug in correlated subquery with OFFSET
> --------------------------------------------------
>
>                 Key: SPARK-46446
>                 URL: https://issues.apache.org/jira/browse/SPARK-46446
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Jack Chen
>            Priority: Major
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice.
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to