zhztheplayer opened a new pull request, #13437:
URL: https://github.com/apache/iceberg/pull/13437

   Currently `SparkBatch` relies on the parent instance of `SparkScan`'s 
`hashCode()` for checking the equality.
   
   
https://github.com/apache/iceberg/blob/28b90ea1870643fcdb3afca5426656ab6caa8163/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L164-L165
   
   
https://github.com/apache/iceberg/blob/28b90ea1870643fcdb3afca5426656ab6caa8163/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java#L211-L223
   
   This is an anti-pattern which could easily cause the objects that are not 
equal to be incorrectly considered equal by the program.
   
   More specifically, Spark relies on the `SparkBatch`'s equality to determine 
whether the two `BatchScanExec` are equal. 
   
   
https://github.com/apache/spark/blob/fdd7f6fb491e4e52898d884db887cd0a8a707ec3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L46-L56
   
   The equality of `BatchScanExec` can be relied by the exchange / subquery 
substitution processes in Spark. An incorrect implementation of `equals()` 
method could result in unexpected behavior of Spark's query planner.
   
   The patch adds a checksum-based algorithm through a newly added utility 
`DigestUtil` to ensure the sanity of the equality comparison.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to