Hi, >From my read of the spec, which may be overly pedantic, it seems like > attaching anything other than NDV + an associated compact theta sketch is > *not* compliant with the spec:
True. In the section on Table Statistics > <http://iceberg.apache.org/spec/#table-statistics> it’s explicit that > statistics are meant to be informational only, and that readers can ignore > statistics at will: “Statistics are informational. A reader can choose to > ignore statistics information. Statistics support is not required to read > the table correctly.” However, it says earlier that statistics are > stored in ‘valid puffin files’, which can contain *exactly two* blob types > <https://iceberg.apache.org/puffin-spec/#blob-types>: > ‘apache-datasketches-theta-v1` and `deletion-vector-v1`. Yes. Statistics are optional. But it cannot be a `deletion-vector-v1`. It is the other way around. Puffin files can be used to store statistics as well as indexes, delete files (like deletion vectors). When we enhanced the puffin spec to include `deletion-vector-v1`, we didn't update the statistics table spec to clarify that it can only be ‘apache-datasketches-theta-v1`. Feel free to open a PR to clarify it. Asked explicitly: Is an Iceberg table with a Statistics file containing a > blob type other than ‘apache-datasketches-theta-v1` and > `deletion-vector-v1` a valid Iceberg table? Should engines ignore > unrecognized blob types in blob metadata structs and associated statistics > file? Yes. Engines should ignore it as stats are optional. Currently the spark integration <https://github.com/apache/iceberg/blob/772c8275598e43d2c5ef029bfe83aeaa6c713e8a/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L226> ignores it. Lastly, I sense that you have proprietary stats stored as puffin files as a new blob type and figuring out how the interoperability works if other engines cannot understand it. I recommend discussing with the community and contributing that stats to Iceberg puffin spec and standardizing to support interoperability with other engines. - Ajantha On Tue, Aug 5, 2025 at 8:44 PM Summers, Carl <summ...@amazon.co.uk.invalid> wrote: > Hi, > > > > I’m looking to better understand the intent of some of the language around > table statistics and related puffin file usage. From my read of the spec, > which may be overly pedantic, it seems like attaching anything other than > NDV + an associated compact theta sketch is *not* compliant with the spec: > > > > In the section on Table Statistics > <http://iceberg.apache.org/spec/#table-statistics> it’s explicit that > statistics are meant to be informational only, and that readers can ignore > statistics at will: “Statistics are informational. A reader can choose to > ignore statistics information. Statistics support is not required to read > the table correctly.” However, it says earlier that statistics are > stored in ‘valid puffin files’, which can contain *exactly two* blob types > <https://iceberg.apache.org/puffin-spec/#blob-types>: > ‘apache-datasketches-theta-v1` and `deletion-vector-v1`. > > > > I can appreciate that a reasonable engine author, upon encountering an > unexpected blob type in a Puffin file, would ignore it as statistics are > purely informational. However, given that puffin files are now both > informational and critical for correctness (albeit in different contexts), > I could see another reasonable engine author choosing to fail a query as > the table isn’t compliant to the spec. “Breaking” a customer’s usage of > their table is just about the worst thing we can do, so I’d really > appreciate some community guidance here. > > > > Asked explicitly: Is an Iceberg table with a Statistics file containing a > blob type other than ‘apache-datasketches-theta-v1` and > `deletion-vector-v1` a valid Iceberg table? Should engines ignore > unrecognized blob types in blob metadata structs and associated statistics > file? > > > > --Carl >