bodduv commented on PR #14500:
URL: https://github.com/apache/iceberg/pull/14500#issuecomment-3541401314

   Thank you for the comment @pvary 
   > * We have a table with an UUID column
   > * We inserted 2 rows to the table with UUID_MIN and UUID_MAX with Java 
Iceberg 1.10.0, and calculated column stats (min=UUID_MAX, max=UUID_MIN)
   
   It matter how a query engine prepares min, max values for UUID columns to 
handle them over for writing manifest file and manifest lists. Some engines 
could use min and max values as prepared by Parquet Java (which is RFC 
compliant) during writes.
   
   > * We run a query which filter on UUID_MIDDLE.
   >   
   >   * I expect that the metadata filtering will return the new file 
(UUID_MAX < UUID_MIDDLE < UUID_MIN), and we will find the row
   > 
   > Am I correct, that after the upgrade the metadata filtering will skip the 
new file (UUID_MIDDLE < UUID_MAX) - filtered out by the wrong min value?
   
   Yes, if the min and max metrics persisted in manifest file and manifest list 
are constructed using the faulty non-RFC compliant UUID comparisons, then yes 
we would not be able to read the new file back with such a filter (on UUID 
column) after upgrading. What is even more problematic (evident in my testing) 
that even an equality filter `uuid_col = ...` will leave out records that are 
supposed to be returned. Note that with a full table scan we will be read the 
new file.
   
   A remedy we would be to migrate the table (doing a full table scan) and 
rewriting metrics accurately.
   
   Note: This issue is only in Java implementation of the spec. Go, Rust, Cpp 
implementations are RFC compliant making the bug more severe. I.e., If the same 
table is read with a filter using Go implementation, it produced correct, but 
different records than when Java implementation is used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to