Hi,

>From my read of the spec, which may be overly pedantic, it seems like
> attaching anything other than NDV + an associated compact theta sketch is
> *not* compliant with the spec:


True.

In the section on Table Statistics
> <http://iceberg.apache.org/spec/#table-statistics> it’s explicit that
> statistics are meant to be informational only, and that readers can ignore
> statistics at will: “Statistics are informational. A reader can choose to
> ignore statistics information. Statistics support is not required to read
> the table correctly.”  However, it says earlier that statistics are
> stored in ‘valid puffin files’, which can contain *exactly two* blob types
> <https://iceberg.apache.org/puffin-spec/#blob-types>:
> ‘apache-datasketches-theta-v1` and `deletion-vector-v1`.


Yes. Statistics are optional.
But it cannot be a `deletion-vector-v1`. It is the other way around. Puffin
files can be used to store statistics as well as indexes, delete files
(like deletion vectors). When we enhanced the puffin spec to include
`deletion-vector-v1`, we didn't update the statistics table spec to clarify
that it can only be ‘apache-datasketches-theta-v1`. Feel free to open a PR
to clarify it.

Asked explicitly: Is an Iceberg table with a Statistics file containing a
> blob type other than ‘apache-datasketches-theta-v1` and
> `deletion-vector-v1` a valid Iceberg table?  Should engines ignore
> unrecognized blob types in blob metadata structs and associated statistics
> file?


Yes. Engines should ignore it as stats are optional.
Currently the spark integration
<https://github.com/apache/iceberg/blob/772c8275598e43d2c5ef029bfe83aeaa6c713e8a/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L226>
ignores it.

Lastly, I sense that you have proprietary stats stored as puffin files as a
new blob type and figuring out how the interoperability works if other
engines cannot understand it. I recommend discussing with the community and
contributing that stats to Iceberg puffin spec and standardizing to support
interoperability with other engines.

- Ajantha

On Tue, Aug 5, 2025 at 8:44 PM Summers, Carl <summ...@amazon.co.uk.invalid>
wrote:

> Hi,
>
>
>
> I’m looking to better understand the intent of some of the language around
> table statistics and related puffin file usage.  From my read of the spec,
> which may be overly pedantic, it seems like attaching anything other than
> NDV + an associated compact theta sketch is *not* compliant with the spec:
>
>
>
> In the section on Table Statistics
> <http://iceberg.apache.org/spec/#table-statistics> it’s explicit that
> statistics are meant to be informational only, and that readers can ignore
> statistics at will: “Statistics are informational. A reader can choose to
> ignore statistics information. Statistics support is not required to read
> the table correctly.”  However, it says earlier that statistics are
> stored in ‘valid puffin files’, which can contain *exactly two* blob types
> <https://iceberg.apache.org/puffin-spec/#blob-types>:
> ‘apache-datasketches-theta-v1` and `deletion-vector-v1`.
>
>
>
> I can appreciate that a reasonable engine author, upon encountering an
> unexpected blob type in a Puffin file, would ignore it as statistics are
> purely informational.  However, given that puffin files are now both
> informational and critical for correctness (albeit in different contexts),
> I could see another reasonable engine author choosing to fail a query as
> the table isn’t compliant to the spec.  “Breaking” a customer’s usage of
> their table is just about the worst thing we can do, so I’d really
> appreciate some community guidance here.
>
>
>
> Asked explicitly: Is an Iceberg table with a Statistics file containing a
> blob type other than ‘apache-datasketches-theta-v1` and
> `deletion-vector-v1` a valid Iceberg table?  Should engines ignore
> unrecognized blob types in blob metadata structs and associated statistics
> file?
>
>
>
> --Carl
>

Reply via email to