francis0407 commented on issue #24344: [SPARK-27440][SQL] Optimize uncorrelated predicate subquery URL: https://github.com/apache/spark/pull/24344#issuecomment-483272960 Hi, I'm back with some ideas. I have not found any SQL system that implements the conversion. But IMO, this method still can be used with some restrictions. I think the question we need to pay more attention to is that `EXISTS` is not nullable while `IN` could be `null` when using `NOT IN (subquery with nulls)`. Remember our goal is to avoid collecting all rows of the subquery's result or using a join. Instead, we need to push the outer condition into the subquery (e.g. `a in select b` => `where a = b`). In conclusion, we can have three cases here: 1. For uncorrelated `EXISTS(subquery)`, we rewrite it as `EXISTS(select 1 from ... where ... limit 1)` 2. For uncorrelated `a IN(select b ...)`, we rewrite it as `EXISTS(select 1 from ... where a = b and ... limit 1)` 3. For uncorrelated `a NOT IN(select b ...)`, we can define an aggregate function `agg` (not sure about the name) here. It is initialized with `false` , returns `null` if there are nulls in the rows, and returns `true` if the result set is not empty and has no nulls. Then we can rewrite the subquery as a scalar subquery: `NOT (select agg(b) from ... where (a = b or b is null))`. I'm not sure whether this method is right, corrections are truly welcomed!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
