Hi Dzeri,

Thanks for writing this up. I agree that the stats-only case is the
important one to separate from data-changing commits, and that replacement
is the tricky part.

My mental model for the lifecycle is:

1. An engine analyzes an existing snapshot.
2. It writes a statistics file for that snapshot.
3. It commits a new table metadata version that references the statistics
file, without creating a new snapshot.
4. That metadata entry is carried forward until the snapshot expires, or
until something explicitly replaces or removes it.

A writer that does not understand a blob type has no basis to validate it.
Because of that, I think dropping an unknown blob is safer than carrying it
forward into a newly written statistics file, where it may become obsolete
or misleading. I think we should be aggressive about replacing statistics,
and users should be aware that running a stats-producing operation may
replace the statistics for that snapshot and drop statistics written by
another engine.

I am also concerned about multiple statistics files per snapshot from a
planning-latency perspective. Spark's current column-statistics planning
path reads NDV from statistics file metadata, not by opening Puffin files
([SparkScan.estimateStatistics](
https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L198-L245)).
If multiple files require Puffin reads during planning, that would
introduce a new planning-time I/O path. One thing to consider is whether
statistics loading should move to table load time instead. This is metadata
after all, and I would expect the relevant Puffin metadata to be small,
probably hundreds of KB at most.

Overall, I think creating a snapshot when statistics are written is the
simplest model and makes the most sense to me. I would tie statistics to
snapshots so they can be expired with the snapshots they belong to. Also,
because a statistics file can already contain multiple blobs, allowing
multiple statistics files per snapshot feels similar to allowing more
blobs, but with extra file-level lifecycle and planning cost. Problems
could quickly arise if engine A drops stats X and Y, and then engine B,
which expects X, Y, and Z together, later finds only Z.

What do you think?

Best,
Tamas

On Thu, 25 Jun 2026 at 12:10, dzeri96 <[email protected]> wrote:

>
> Hi everyone,
>
> I've recently started a discussion on Slack and was advised to post in the
> dev mailing list.
> As puffin/statistics files are starting to catch on, we are bound to come
> across situations where one writer wants to create a new statistics file
> while some data which it might not understand is already present in the
> current snapshot's statistics file. I've come across this problem in real
> life, when I ran `ANALYZE TABLE` in iceberg-spark, which created a new
> metadata file and replaced my proprietary index data with its own.
> You could argue that a single type of writer is expected for a table, but
> on the other hand, the spirit of Iceberg is portability. We can't know
> who's accessing the table and possibly corrupting its (statistics-)data.
>
> Before I get into the proposed solutions, I think it's important to
> distinguish two scenarios in which statistics files are being written:
> data-changing and non-data-changing.
> For data-changing scenarios, I think it's reasonable to assume that old
> statistics files are no longer valid, and are therefore OK to replace. In
> the rest of this email, I will focus on scenarios where statistics are
> being generated and attached to the current snapshot via a new metadata
> file, as these are the problematic ones.
>
> After a short discussion in Slack, we roughly see three possible
> solutions. I think all of them require a change to the iceberg spec, but
> with varying gravity:
>
>       1. Enforce carry-over of unknown blob data into new puffin files.
>            Pros:
>              - Backwards-compatible reads, not only in terms of the
> iceberg spec, but also in terms of statistics files semantics.
>              - Simple to implement because blob-level metadata is already
> available.
>              - One reader could potentially understand statistics
> blobs calculated by different writers.
>            Cons:
>             - Write amplification.
>             - Conflict resolution might require re-writing the whole file
> again.
>
>        2. Allow for multiple statistics files to be bound to a snapshot.
>            Pros:
>             - Avoids write amplification.
>             - Each writer cares only about its own statistics file.
>             - Finding relevant statistics files is easy thanks to
> file-level metadata.
>             - One reader could understand statistics files written by
> different writers.
>            Cons:
>             - Backwards-incompatible reads.
>
>       3. Create new snapshot when computing statistics.
>           Pros:
>             - Avoids write amplification.
>             - Each writer cares only about its own statistics files.
>          Cons:
>            - Requires readers to iterate over past snapshots in order to
> find last valid entry written by a compatible writer.
>
> I've definitely left some pros and cons out, but you can roughly map these
> cases to ways we handle existing file types (metadata, manifest lists,
> manifests). I'm sure people who have spent time designing the spec can more
> easily list out the possible pitfalls. In my humble opinion, #3 might be
> the most straightforward, but #2 is what I initially expected from the
> spec. We are doing #1 internally because it's the only thing we can do in
> the current situation.
>
> Let me know what you think.
> Cheers,
> Dzeri
>
>

Reply via email to