bodduv opened a new issue, #14216:
URL: https://github.com/apache/iceberg/issues/14216

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   None
   
   ### Please describe the bug šŸž
   
   I could not find documentation regarding how UUID values should be compared 
in Iceberg specification. Both [RFC 
4122](https://datatracker.ietf.org/doc/html/rfc4122) and [RFC 
9562](https://www.rfc-editor.org/rfc/rfc9562.html) mention that comparison of 
UUID values is performed by lexicographically comparing ([Augmented Backus–Naur 
form](https://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_form)) string 
representation, or more commonly/optimally, by unsigned byte-to-byte value 
comparisons (in big-endian byte order). Note that RFC 4122 also attaches 
reference C code containing implementation for the latter.
   
   In contrast, Java's UUID type uses two (2's complement) signed long values 
to represent a UUID value and compares the corresponding signed long values. In 
other words, a leading set bit (corresponds to a negative long value) 
represents a smaller UUID – entirely breaking away from the RFC. The relevant 
code that is causing this behavior is 
https://github.com/apache/iceberg/blob/441597e22ef3ec1ea03fd837cbc1e5dffce899a4/api/src/main/java/org/apache/iceberg/types/Comparators.java#L49
   
   [Apache Parquet 
specification](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid)
 explicitly mentions that the sort order is via unsigned byte-wise comparison.
   
   I found the following contradiction:
   apache/iceberg: based on `java.util.UUID` containing a [known bug
   ](https://bugs.openjdk.org/browse/JDK-7025832)apache/iceberg-go: based on 
[google/uuid](https://github.com/google/uuid) RFC 9562 complaint
   apache/iceberg-python: (unsigned?) byte-wise comparison
   apache/iceberg-rust: based on [uuid 
crate](https://docs.rs/uuid/latest/uuid/) also RFC 9562 compliant
   
   Because of this difference, calling Iceberg APIs via `apache/iceberg` will 
produce different results than that of the other implementations. One such 
problematic case among others is, for instance, reading an Iceberg table with 
UUID columns with an equality filter on a UUID column (that kicks in 
`InclusiveMetricsEvaluator` for ManifestFile filtering -- (also how are MIN, 
MAX prepared?), java based `apache/iceberg` implementation may return different 
records compared to the other implementations.
   
   Question: Can we formalize in the Iceberg spec how UUID comparisons should 
happen?
   
   ### Willingness to contribute
   
   - [x] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to