khjoshi94 commented on issue #15427:
URL: https://github.com/apache/iceberg/issues/15427#issuecomment-3949421592

   Thank you very much @RussellSpitzer for sharing your thoughts so fast. 
   
   Regarding your comment:
   > Could you also explain how there is different behavior on the read side in 
different engines? I'm not sure I followed from your example
   
   Sure. Apologies if it is not clear enough. 
   To my understanding, Iceberg stores partition values exactly as they appear 
in the incoming row. If a value contains trailing whitespace (e.g., `"20240201 
"`), that exact value is written into the manifest.
   
   `Spark` use strict string equality during partition pruning. So `"20240201"` 
does not match `"20240201 "` and the file is pruned. This results in empty 
reads or empty joins.
   
   Also to what I found and understood is that, Athena’s partition filtering 
behavior is not identical to Spark because it relies on AWS Glue’s metadata and 
evaluation rules, which do not always apply strict byte‑for‑byte string 
equality. This could  lead to Athena matching values that Spark would prune.
   
   Based on these findings and observations noted above, this may lead to 
inconsistent behavior across engines i.e. the same Iceberg table returns 
different results depending on where it is queried.
   
   Normalizing string partition values at write time ensures that all engines 
see the same canonical value and apply the same pruning logic.
   
   Hope I was able to clarify based on my limited understanding. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to