khjoshi94 commented on issue #15427: URL: https://github.com/apache/iceberg/issues/15427#issuecomment-3954515986
@RussellSpitzer Thanks for your response. Please let me restate the point more precisely, since my earlier example may have introduced some confusion. The earlier example was meant to illustrate cross‑engine differences in behavior, not to suggest that Athena treats "foo" and "foo " as equal. All engines, including Spark and Athena, apply strict string equality during partition pruning, so "foo" does not match "foo " anywhere. The underlying issue I’m trying to highlight is that Iceberg writes string partition values exactly as they appear in the incoming row. If a value contains trailing whitespace, that exact value is stored in the manifest. Because all engines use strict equality for partition pruning, a user filter like `batch_date = '20240201'` will not match a stored value like `'20240201 '` even though the data logically belongs to that partition. I also appreciate your earlier point about Iceberg community decision to not sanitize user input. I agree with that perspective. My view here is that normalizing string partition values at write time helps ensure that strict equality behaves consistently and avoids silent mis‑pruning across all engines. From that angle, this change still provides some value by preventing a comparatively not so obvious correctness issue which at times may be tricky to troubleshoot. Thanks again for your prompt responses. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
