zhztheplayer opened a new pull request, #13437: URL: https://github.com/apache/iceberg/pull/13437
Currently `SparkBatch` relies on the parent instance of `SparkScan`'s `hashCode()` for checking the equality. https://github.com/apache/iceberg/blob/28b90ea1870643fcdb3afca5426656ab6caa8163/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L164-L165 https://github.com/apache/iceberg/blob/28b90ea1870643fcdb3afca5426656ab6caa8163/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java#L211-L223 This is an anti-pattern which could easily cause the objects that are not equal to be incorrectly considered equal by the program. More specifically, Spark relies on the `SparkBatch`'s equality to determine whether the two `BatchScanExec` are equal. https://github.com/apache/spark/blob/fdd7f6fb491e4e52898d884db887cd0a8a707ec3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala#L46-L56 The equality of `BatchScanExec` can be relied by the exchange / subquery substitution processes in Spark. An incorrect implementation of `equals()` method could result in unexpected behavior of Spark's query planner. The patch adds a checksum-based algorithm through a newly added utility `DigestUtil` to ensure the sanity of the equality comparison. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
