francis0407 commented on a change in pull request #24344: [SPARK-27440][SQL]
Optimize uncorrelated predicate subquery
URL: https://github.com/apache/spark/pull/24344#discussion_r278608147
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
##########
@@ -551,3 +552,47 @@ object RewriteCorrelatedScalarSubquery extends
Rule[LogicalPlan] {
}
}
}
+
+/**
+ * This rule rewrites uncorrelated PredicateSubquery expressions such as
Exists, InSubquery.
+ * The uncorrelated Exists and InSubquery can be evaluated using a subplan
instead of a semi-join.
+ *
+ * For uncorrelated Exists, we can use `limit 1` and `select 1` after the
Exists subquery to
+ * reduce the result set.
+ * {{{
+ * SELECT * FROM s WHERE EXISTS(SELECT b FROM t WHERE t.a = 2);
+ * ==> SELECT * FROM s WHERE EXISTS(SELECT 1 FROM t WHERE t.a = 2 LIMIT 1);
+ * }}}
+ *
+ * For uncorrelated InSubquery, we can push the left values into the subquery
to reduce the result
+ * set. Note that InSubquery may be nullable, so we can not eliminate nulls
for both sides.
+ * {{{
+ * SELECT * FROM s WHERE 3 IN (SELECT b FROM t WHERE a = 2);
+ * ==> SELECT * FROM s WHERE 3 IN (SELECT b FROM t WHERE a = 2 AND (b = 3 or
b IS NULL));
Review comment:
Yes, I think so.
My main purpose is to ensure the `IN` expression can be evaluated before the
final execution.
If the left values have an attribute, like:
```sql
select *
from t
where t.a in (select b from s)
```
Firstly, we cannot push `t.a` into the subquery. But we still can collect
the subquery before the final execution. If we do that, we may rewrite the
expression as a `InSet`, and broadcast the subquery's result set to each node.
It is just equivalent to using a semi-join. Therefore, I think there is no
need to rewrite a InSubquery with non-literal left values.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]