[GitHub] [spark] allisonwang-db opened a new pull request #32179: [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated

GitBox Wed, 14 Apr 2021 17:31:32 -0700


allisonwang-db opened a new pull request #32179:
URL: https://github.com/apache/spark/pull/32179



   ### What changes were proposed in this pull request?
   This PR updated the `foundNonEqualCorrelatedPred` logic for correlated 
subqueries in `CheckAnalysis` to only allow correlated equality predicates that 
guarantee one-to-one mapping between inner and outer attributes, instead of all 
equality predicates. 
   
   ### Why are the changes needed?
   To fix correctness bugs. Before this fix Spark can give wrong results for 
certain correlated subqueries that pass CheckAnalysis:
   Example 1:
   ```sql
   create or replace view t1(c) as values ('a'), ('b')
   create or replace view t2(c) as values ('ab'), ('abc'), ('bc')
   
   select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from 
t1
   ```
   Correct results: [(a, 2), (b, 1)]
   Spark results:
   ```
   +---+-----------------+
   |c  |scalarsubquery(c)|
   +---+-----------------+
   |a  |1                |
   |a  |1                |
   |b  |1                |
   +---+-----------------+
   ```
   Example 2:
   ```sql
   create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3);
   create or replace view t2(c) as values (6);
   
   select c, (select count(*) from t1 where a + b = c) from t2;
   ```
   Correct results: [(6, 4)]
   Spark results:
   ```
   +---+-----------------+
   |c  |scalarsubquery(c)|
   +---+-----------------+
   |6  |1                |
   |6  |1                |
   |6  |1                |
   |6  |1                |
   +---+-----------------+
   ```
   ### Does this PR introduce _any_ user-facing change?
   Yes. Users will not be able to run queries that contain unsupported 
correlated equality predicates.
   
   ### How was this patch tested?
   Added unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] allisonwang-db opened a new pull request #32179: [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated

Reply via email to