[GitHub] [spark] HyukjinKwon opened a new pull request #28695: [SPARK-28344][SQL] Check the ambiguous self-join only if there is a join in the plan

GitBox Mon, 01 Jun 2020 07:10:04 -0700


HyukjinKwon opened a new pull request #28695:
URL: https://github.com/apache/spark/pull/28695



   ### What changes were proposed in this pull request?
   
   This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` 
in the plan. Currently, the checking is too strict even to non-join queries.
   
   For example, the codes below don't have join at all but it fails as the 
ambiguous self-join:
   
   ```scala
   import org.apache.spark.sql.expressions.Window
   import org.apache.spark.sql.functions.sum
   val df = Seq(1, 1, 2, 2).toDF("A")
   val w = Window.partitionBy(df("A"))
   df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true)
   ```
   
   It is because `ExtractWindowExpressions` can create a `AttributeReference` 
with the same metadata but a different expression ID, see:
   
   
https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679
   
https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63
   
https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180
   
   Before:
   
   ```
   'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
   +- Relation[A#19] parquet
   ```
   
   After:
   
   ```
   Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
   +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A 
unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
      +- Window [sum(A#19) windowspecdefinition(A#19, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) 
AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19]
         +- Project [A#19 AS X#21, A#19]
            +- Relation[A#19] parquet
   ```
   
   `X#21` holds the same metadata of DataFrame ID and column position with 
`A#19` but it has a different expression ID which ends up with the checking 
fails.
   
   ### Why are the changes needed?
   
   To loose the checking and make users not surprised.
   
   ### Does this PR introduce _any_ user-facing change?
   
   It's the changes in unreleased branches only.
   
   ### How was this patch tested?
   
   Manually tested and unittest was added.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon opened a new pull request #28695: [SPARK-28344][SQL] Check the ambiguous self-join only if there is a join in the plan

Reply via email to