Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/17491#discussion_r109272800
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
---
@@ -90,11 +90,12 @@ trait PredicateHelper {
* Returns true iff `expr` could be evaluated as a condition within join.
*/
protected def canEvaluateWithinJoin(expr: Expression): Boolean = expr
match {
- case l: ListQuery =>
+ case _: ListQuery | _: Exists =>
// A ListQuery defines the query which we want to search in an IN
subquery expression.
// Currently the only way to evaluate an IN subquery is to convert
it to a
// LeftSemi/LeftAnti/ExistenceJoin by `RewritePredicateSubquery`
rule.
// It cannot be evaluated as part of a Join operator.
+ // An Exists shouldn't be push into a Join operator too.
--- End diff --
> I am not sure. The name of this is def canEvaluateWithinJoin so I assume
it asks whether an input Expression can be processed as part of a Join
operator. Can a ScalarSubquery be processed inside a Join?
I remember `ScalarSubquery` without correlated reference will be evaluated
as individual query plan and get its result back as an expression. So it should
not no difference in run time compared with other expressions.
A `Limit` looks good to me for now. I can't think a negative side effect
prevents possible optimization for the subquery plan. Doesn't it just like a
re-written query with a limit clause added?
I think this is a corner usage case. To address this in run-time like the
introduction of "early out" into physical join operators works, but it may
involve a lot of code changes.
> 2) one more step further: cache the result of the RHS without a rescan
because the next row from the parent table will always get the same answer from
rescanning the subquery.
I quickly scan physical `SortMergeJoin` operator. If the streamed row
matches the scanned group of rows, it will reuse the scanned group. Sounds it
does what you said, if I don't miss something.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]