Hello everyone, Following our previous community sync discussions, I've put together a proposal document [2] exploring approaches to handle the UUID comparison bug [1]. I apologize for the delay. The document provides details, but these are the two proposals summarized.
Proposal A: Dual-Comparator Evaluation + Block UUID in StrictMetricsEvaluator • Inclusive evaluators try both unsigned (RFC) and signed comparators, OR the results • *[Not Yet Part of PR]* Disable StrictMetricsEvaluator for UUID predicates or Schema • Pro: Preserves read-path pruning for old files with legacy metrics, no metadata changes • Con: permanent double evaluation overhead, no way to graduate past it, loss of validation checks (StrictMetricsEvaluator) Proposal B: Block UUID Filtering and Comparator Marker • Block UUID metric evaluator for data files/manifests that lack a comparator marker • New manifests/files written post-fix include a marker (uuid-stats-comparator in Avro manifest metadata or a new DataFile field) indicating stats use the correct unsigned comparator • Only marked data files/manifests receive full (single-comparator) pruning • Pro: converges to full performance eventually; avoids Dual-Comparator code complexity (this seems debatable) • Con: Legacy data has no UUID pruning until rewritten (unlike Proposal A which still prunes); since I don't have a proof of concept regarding this, it's currently more of a theoretical idea. Maybe it's possible to mix and match parts of the proposal if it's reasonable. The document details StrictMetricsEvaluator's three use cases with SQL examples (as this is currently the only weak point of the PR [3]). The document was converted from Markdown to DOCX, so apologies if its formatting is odd. I'd appreciate feedback on the proposals. [1] https://docs.google.com/document/d/1pj2wIDJyV1-9NrSldlg8JUrBSRf7LNDBybIFiGzgAhQ/edit?tab=t.jiqhmlg9liug [2] https://github.com/apache/iceberg/issues/14216 [3] https://github.com/apache/iceberg/pull/14500 To me Proposal A seems like a viable approach if we supplement it with some parts of Proposal B in a follow-up PR that addresses UUID expressions being disabled for StrictMetricsEvaluator. Cheers, Vishal Boddu On Tue, Sep 30, 2025 at 5:41 PM Vishal Boddu <[email protected]> wrote: > Hello Iceberg Community, > > I'm reaching out regarding a correctness issue I've encountered while > working with Iceberg tables with UUID type columns. In Java based spec > implementation github.com/apache/iceberg UUID literals are compared using > `java.util.UUID.compareTo` which has a known bug due to its non-compliance > with RFC 4122 and RFC 9562 for comparisons. I opened an issue for this at > https://github.com/apache/iceberg/issues/14216. Also note that > implementations based on Go, C++, rust, and python do not contain this bug > as far as I can tell and only affects Java based implementation. > > Background: > The Apache Iceberg specification has supported the UUID data type for > several years now. > According to RFC 4122 and RFC 9562, UUIDs should be compared using > unsigned byte-wise comparisons. The Apache Parquet specification is in > compliance with what is stipulated in the RFCs. However, the Java > implementation of Apache Iceberg specification relies on java.util.UUID, > which performs comparisons using 2 signed long values - a well-known bug > that violates the RFC specification. > Correctness issues may be encountered when reading Iceberg tables: during > ManifestEntry filtering, ManifestEvaluator uses UUID MIN/MAX metrics from > ManifestEntry to prune partitions and data files based on filter > expressions. Due to the incorrect comparison semantics, some ManifestEntry > that should NOT be pruned are incorrectly pruned, potentially leading to > missing data in query results. > > While interest for UUID data type may not be that high (possibly the > reason this bug did not surface), I think we should consider fixing this > bug. I am thinking of two potential approaches and would appreciate the > community's input: > > Option 1: Fix the comparison logic directly in Apache Iceberg Java > implementation > While this provides correct UUID comparison semantics going forward and > aligns with RFC specifications and Parquet implementation, it could cause > backward-compatibility issues with existing tables and folks relying on the > bug. Existing UUID metrics would need migration or invalidation (?) > > Option 2: Disable UUID filter pushdown to ManifestEvaluator. > It avoids "different query results" issues with existing tables. No > migration required for existing metrics (but what do we do with these > MIN/MAX UUID metrics? leave dormant?) Some performance degradation due to > inability to prune partitions and data files based on UUID predicates > Doesn't address the underlying bug. > > Questions for the Community > What does (or should) the Iceberg specification say how UUID values are to > be compared? Note that some engines may use Apache Parquet (RFC-compliant) > to prepare MIN/MAX metrics for ManifestFile. > Has anyone else encountered this issue or have thoughts on the best path > forward? > For Option 1, what would be the recommended migration strategy for > existing tables with UUID MIN/MAX metrics? > Are there other approaches I haven't considered? > Would the community be open to a fix that includes a table format version > bump or a feature flag to handle backward compatibility? > > I can try to contribute a fix once we align on the approach. Any guidance > or feedback would be greatly appreciated. > > Thank you for your time and consideration. > > Cheers, > Vishal Boddu > -- Vishal Boddu Senior Software Engineer Datalake, Dremio Dremio.com <https://www.dremio.com/?utm_medium=email&utm_source=signature&utm_term=na&utm_content=email-signature&utm_campaign=email-signature> / Follow Us on LinkedIn <https://www.linkedin.com/company/dremio> / Get Started <https://www.dremio.com/get-started/> The Agentic Lakehouse The only lakehouse built for agents, managed by agents
