bodduv commented on PR #14500: URL: https://github.com/apache/iceberg/pull/14500#issuecomment-3760098770
@pvary Thank you for promptly jumping in (during previous community sync) while I could not recollect the details of why we wanted to pause on the behavior change. It was discussed that evaluating expressions once with RFC-compliant comparator and again with signed comparator to then allow the filter to pass if either of the expression evaluations is true. I attempt this in [51f873e](https://github.com/apache/iceberg/pull/14500/commits/51f873e75a77a85176a96c2cfceb9acf15b83834). Following is a short description of the changes. - __Write Path__: while writing new Parquet files, metrics are prepared using RFC-compliant UUID comparison (Comparators.uuids()), producing correct min/max statistics. - __Read Path__: uses two expression evaluations: once with RFC-compliant comparator (to work with newly written files) and another with signed comparator for backward compatibility with older files that may have UUID statistics computed with Java's signed UUID.compareTo(). Affected evaluators: InclusiveMetricsEvaluator, ManifestEvaluator, ParquetMetricsRowGroupFilter - __Performance__: (almost) zero overhead for non-UUID queries. May require evaluating (filter) expressions twice, so there is some non-zero performance impact. But I suppose we would prioritize correctness. Implementation around expression evaluation was not conducive to such two fold evaluations. So I had to force it. I hope the increased code complexity is acceptable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
