Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r107077129 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -61,6 +63,37 @@ abstract class SubqueryExpression( } } +/** + * This expression is used to represent any form of subquery expression namely + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the + * expression equality works properly when LogicalPlan.sameResult is called + * on plans containing SubqueryExpression(s). This is only a transient expression + * that only lives in the scope of sameResult function call. In other words, analyzer, + * optimizer or planner never sees this expression type during transformation of + * plans. + */ +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression) + extends UnaryExpression with Unevaluable { + override def dataType: DataType = expr.dataType + override def nullable: Boolean = expr.nullable + override def child: Expression = expr + override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})" + + // Hashcode is generated conservatively for now i.e it does not include the + // sub query plan. Doing so causes issue when we canonicalize expressions to + // re-order them based on hashcode. + // TODO : improve the hashcode generation by considering the plan info. + override def hashCode(): Int = { + val h = Objects.hashCode(expr.children) --- End diff -- @cloud-fan Hi Wenchen, let me take an example and perhaps that will help us. Lets assume there is a Filter operator that has 3 subquery expressions like following. ``` scala Filter Scalar-subquery(.., splan1,..) || Exists(.., splan2, ..) || In(.., ListQuery(.., splan3, ..) ``` 1. During sameResult on this plan, we perform this [logic](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L367-L375) 2. We clean the args from both source and target plans. 2.1 We change the subquery expression to CanonicalizedSubqueryExpression so that we can do a customized equals that basically does a semantic equals of two subquery expressions. 2.2 Additionally we change the subquery plan (splan1, splan2, splan3) to change the outer attribute references to BindReference. Since these are outer references the inner plan is not aware of these attributes and hence we fix them as part of canonicalize. After this, when the equals method is called on CanonicalizedSubqueryExpression [here](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L372) and the sameResult is called on splan1, splan2, splan3 it works fine as the outer references have been normalized to BindReference. The local attributes referenced in the plan is handled in sameResult itself. 2.3 As part of cleanArgs, after cleaning the expressions we call canonicalized [here](https://github.com/dilipbiswal/spark/blob/0c6424574282db0e865ef4ad14c5a9179fcf27ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L402). So here we try to re-order 3 CanonicalizedSubqueryExpressions on the basis of their hashCode. When i had the hashCode call expr.semanticHash(), i observed that the ordering of the expressions between source and target is not consistent. I debugged to find that expr.semanticHash() considers the subquery plans as they are the class arguments and since plans have arbitary expression ids (we have only normalized the outer references.. not all of them at this point). Thus, in the current implementation i am excluding the subplans while computing the hashcode and have marked it as a todo. Please let me know your thoughts and let me know if i have misunderstood any thing.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org