francis0407 commented on a change in pull request #24344: [SPARK-27440][SQL] 
Optimize uncorrelated predicate subquery
URL: https://github.com/apache/spark/pull/24344#discussion_r278601803
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ##########
 @@ -551,3 +552,47 @@ object RewriteCorrelatedScalarSubquery extends 
Rule[LogicalPlan] {
       }
   }
 }
+
+/**
+ * This rule rewrites uncorrelated PredicateSubquery expressions such as 
Exists, InSubquery.
+ * The uncorrelated Exists and InSubquery can be evaluated using a subplan 
instead of a semi-join.
+ *
+ * For uncorrelated Exists, we can use `limit 1` and `select 1` after the 
Exists subquery to
+ * reduce the result set.
+ * {{{
+ *  SELECT * FROM s WHERE EXISTS(SELECT b FROM t WHERE t.a = 2);
+ *  ==> SELECT * FROM s WHERE EXISTS(SELECT 1 FROM t WHERE t.a = 2 LIMIT 1);
+ * }}}
+ *
+ * For uncorrelated InSubquery, we can push the left values into the subquery 
to reduce the result
+ * set. Note that InSubquery may be nullable, so we can not eliminate nulls 
for both sides.
+ * {{{
+ *  SELECT * FROM s WHERE 3 IN (SELECT b FROM t WHERE a = 2);
+ *  ==> SELECT * FROM s WHERE 3 IN (SELECT b FROM t WHERE a = 2 AND (b = 3 or 
b IS NULL));
+ * }}}
+ */
+object RewriteUncorrelatedSubquery extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case Filter(condition, child) =>
+      val newCondition = condition transform {
+        case Exists(sub, children, _) if children.isEmpty =>
+          // Wrap the subquery with `limit 1` and `project 1`
+          val newPlan =
+            Project(Seq(Alias(Literal.create(1, IntegerType), "1")()),
+              Limit(Literal.create(1, IntegerType), sub))
+          Exists(newPlan)
+        case InSubquery(values, ListQuery(sub, children, _, childOutputs))
+          if values.forall(_.foldable) && children.isEmpty =>
+          // Push the outer values into the subquery
+          val inCondition = values.zip(sub.output).map {
+            case (outer, inner) =>
+              Or(EqualTo(outer, inner), IsNull(EqualTo(outer, inner)))
 
 Review comment:
   Oh, I forget to update the comment...
   If we use `IsNull(inner)` , we cannot deal with `null in (subquery)` 
correctly.
   Example:
   We have a table **t1**:
   ```
   +---+----+
   |t1a| t1b|
   +---+----+
   |  1|   1|
   |  2|   2|
   +---+----+
   ```
   and a query:
   ```sql
   select *
   from t
   where null in (select t1a from t1);
   ```
   The subquery expression should be evaluated as `null in (1, 2)`, which 
returns `null`.
   
   If we rewrite it, using `IsNULL(inner)`:
   ```sql
   select *
   from  t
   where null in (select t1a from t1 where (t1a=null or t1a is null))
   ```
   It will be evaluated as `null in (an empty set)`, which returns `false`.
   
   If we use `IsNULL(EqualTo(inner, outer))`:
   ```sql
   select *
   from  t
   where null in (select t1a from t1 where (t1a=null or (t1a=null) is null))
   ```
   Then we can get the correct result set, `null in (1, 2)`, which returns 
`null`.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to