bodduv opened a new issue, #14216: URL: https://github.com/apache/iceberg/issues/14216
### Apache Iceberg version main (development) ### Query engine None ### Please describe the bug š I could not find documentation regarding how UUID values should be compared in Iceberg specification. Both [RFC 4122](https://datatracker.ietf.org/doc/html/rfc4122) and [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) mention that comparison of UUID values is performed by lexicographically comparing ([Augmented BackusāNaur form](https://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_form)) string representation, or more commonly/optimally, by unsigned byte-to-byte value comparisons (in big-endian byte order). Note that RFC 4122 also attaches reference C code containing implementation for the latter. In contrast, Java's UUID type uses two (2's complement) signed long values to represent a UUID value and compares the corresponding signed long values. In other words, a leading set bit (corresponds to a negative long value) represents a smaller UUID ā entirely breaking away from the RFC. The relevant code that is causing this behavior is https://github.com/apache/iceberg/blob/441597e22ef3ec1ea03fd837cbc1e5dffce899a4/api/src/main/java/org/apache/iceberg/types/Comparators.java#L49 [Apache Parquet specification](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid) explicitly mentions that the sort order is via unsigned byte-wise comparison. I found the following contradiction: apache/iceberg: based on `java.util.UUID` containing a [known bug ](https://bugs.openjdk.org/browse/JDK-7025832)apache/iceberg-go: based on [google/uuid](https://github.com/google/uuid) RFC 9562 complaint apache/iceberg-python: (unsigned?) byte-wise comparison apache/iceberg-rust: based on [uuid crate](https://docs.rs/uuid/latest/uuid/) also RFC 9562 compliant Because of this difference, calling Iceberg APIs via `apache/iceberg` will produce different results than that of the other implementations. One such problematic case among others is, for instance, reading an Iceberg table with UUID columns with an equality filter on a UUID column (that kicks in `InclusiveMetricsEvaluator` for ManifestFile filtering -- (also how are MIN, MAX prepared?), java based `apache/iceberg` implementation may return different records compared to the other implementations. Question: Can we formalize in the Iceberg spec how UUID comparisons should happen? ### Willingness to contribute - [x] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
