[GitHub] [spark] Ngone51 opened a new pull request #32692: [SPARK-35454][SQL][3.1] One LogicalPlan can match multiple dataset ids

GitBox Fri, 28 May 2021 00:28:32 -0700


Ngone51 opened a new pull request #32692:
URL: https://github.com/apache/spark/pull/32692



   ### What changes were proposed in this pull request?
   
   Change the type of `DATASET_ID_TAG` from `Long` to `HashSet[Long]` to allow 
the logical plan to match multiple datasets.
   
   ### Why are the changes needed?
   
   During the transformation from one Dataset to another Dataset, the 
DATASET_ID_TAG of logical plan won't change if the plan itself doesn't change:
   
   
https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L234-L237
   
   However, dataset id always changes even if the logical plan doesn't change:
   
https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L207-L208
   
   And this can lead to the mismatch between dataset's id and col's 
__dataset_id. E.g.,
   
   ```scala
     test("SPARK-28344: fail ambiguous self join - Dataset.colRegex as column 
ref") {
       // The test can fail if we change it to:
       // val df1 = spark.range(3).toDF()
       // val df2 = df1.filter($"id" > 0).toDF()
       val df1 = spark.range(3)
       val df2 = df1.filter($"id" > 0)
   
       withSQLConf(
         SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
         SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
         assertAmbiguousSelfJoin(df1.join(df2, df1.colRegex("id") > 
df2.colRegex("id")))
       }
     }
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 opened a new pull request #32692: [SPARK-35454][SQL][3.1] One LogicalPlan can match multiple dataset ids

Reply via email to