liangz1 opened a new pull request #27565: [SPARK-30791] Dataframe add 
sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565
 
 
   ### What changes were proposed in this pull request?
   This PR added two DeveloperApis to the Dataset[T] class. Both methods are 
just exposing lower-level methods to the Dataset[T] class.
   
   
   ### Why are the changes needed?
   They are useful for checking whether two dataframes are the same when 
implementing dataframe caching in python, and also get a unique ID. It's easier 
to use if we wrap the lower-level APIs.
   
   ### Does this PR introduce any user-facing change?
   ```
   scala> val df1 = Seq((1,2),(4,5)).toDF("col1", "col2")
   df1: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
   
   scala> val df2 = Seq((1,2),(4,5)).toDF("col1", "col2")
   df2: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
   
   scala> val df3 = Seq((0,2),(4,5)).toDF("col1", "col2")
   df3: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
   
   scala> val df4 = Seq((0,2),(4,5)).toDF("col0", "col2")
   df4: org.apache.spark.sql.DataFrame = [col0: int, col2: int]
   
   scala> df1.semanticHash
   res0: Int = 594427822
   
   scala> df2.semanticHash
   res1: Int = 594427822
   
   scala> df1.sameSemantics(df2)
   res2: Boolean = true
   
   scala> df1.sameSemantics(df3)
   res3: Boolean = false
   
   scala> df3.semanticHash
   res4: Int = -1592702048
   
   scala> df4.semanticHash
   res5: Int = -1592702048
   
   scala> df4.sameSemantics(df3)
   res6: Boolean = true
   ```
   
   
   ### How was this patch tested?
   The underlying lower-level API `sameResult` is tested in the 
`org.apache.spark.sql.catalyst.plans.SameResultSuite`. The `semanticHash` just 
uses the hashCode, which might not be necessary to test.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to