Hi Dzeri, Thanks for writing this up. I agree that the stats-only case is the important one to separate from data-changing commits, and that replacement is the tricky part.
My mental model for the lifecycle is: 1. An engine analyzes an existing snapshot. 2. It writes a statistics file for that snapshot. 3. It commits a new table metadata version that references the statistics file, without creating a new snapshot. 4. That metadata entry is carried forward until the snapshot expires, or until something explicitly replaces or removes it. A writer that does not understand a blob type has no basis to validate it. Because of that, I think dropping an unknown blob is safer than carrying it forward into a newly written statistics file, where it may become obsolete or misleading. I think we should be aggressive about replacing statistics, and users should be aware that running a stats-producing operation may replace the statistics for that snapshot and drop statistics written by another engine. I am also concerned about multiple statistics files per snapshot from a planning-latency perspective. Spark's current column-statistics planning path reads NDV from statistics file metadata, not by opening Puffin files ([SparkScan.estimateStatistics]( https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L198-L245)). If multiple files require Puffin reads during planning, that would introduce a new planning-time I/O path. One thing to consider is whether statistics loading should move to table load time instead. This is metadata after all, and I would expect the relevant Puffin metadata to be small, probably hundreds of KB at most. Overall, I think creating a snapshot when statistics are written is the simplest model and makes the most sense to me. I would tie statistics to snapshots so they can be expired with the snapshots they belong to. Also, because a statistics file can already contain multiple blobs, allowing multiple statistics files per snapshot feels similar to allowing more blobs, but with extra file-level lifecycle and planning cost. Problems could quickly arise if engine A drops stats X and Y, and then engine B, which expects X, Y, and Z together, later finds only Z. What do you think? Best, Tamas On Thu, 25 Jun 2026 at 12:10, dzeri96 <[email protected]> wrote: > > Hi everyone, > > I've recently started a discussion on Slack and was advised to post in the > dev mailing list. > As puffin/statistics files are starting to catch on, we are bound to come > across situations where one writer wants to create a new statistics file > while some data which it might not understand is already present in the > current snapshot's statistics file. I've come across this problem in real > life, when I ran `ANALYZE TABLE` in iceberg-spark, which created a new > metadata file and replaced my proprietary index data with its own. > You could argue that a single type of writer is expected for a table, but > on the other hand, the spirit of Iceberg is portability. We can't know > who's accessing the table and possibly corrupting its (statistics-)data. > > Before I get into the proposed solutions, I think it's important to > distinguish two scenarios in which statistics files are being written: > data-changing and non-data-changing. > For data-changing scenarios, I think it's reasonable to assume that old > statistics files are no longer valid, and are therefore OK to replace. In > the rest of this email, I will focus on scenarios where statistics are > being generated and attached to the current snapshot via a new metadata > file, as these are the problematic ones. > > After a short discussion in Slack, we roughly see three possible > solutions. I think all of them require a change to the iceberg spec, but > with varying gravity: > > 1. Enforce carry-over of unknown blob data into new puffin files. > Pros: > - Backwards-compatible reads, not only in terms of the > iceberg spec, but also in terms of statistics files semantics. > - Simple to implement because blob-level metadata is already > available. > - One reader could potentially understand statistics > blobs calculated by different writers. > Cons: > - Write amplification. > - Conflict resolution might require re-writing the whole file > again. > > 2. Allow for multiple statistics files to be bound to a snapshot. > Pros: > - Avoids write amplification. > - Each writer cares only about its own statistics file. > - Finding relevant statistics files is easy thanks to > file-level metadata. > - One reader could understand statistics files written by > different writers. > Cons: > - Backwards-incompatible reads. > > 3. Create new snapshot when computing statistics. > Pros: > - Avoids write amplification. > - Each writer cares only about its own statistics files. > Cons: > - Requires readers to iterate over past snapshots in order to > find last valid entry written by a compatible writer. > > I've definitely left some pros and cons out, but you can roughly map these > cases to ways we handle existing file types (metadata, manifest lists, > manifests). I'm sure people who have spent time designing the spec can more > easily list out the possible pitfalls. In my humble opinion, #3 might be > the most straightforward, but #2 is what I initially expected from the > spec. We are doing #1 internally because it's the only thing we can do in > the current situation. > > Let me know what you think. > Cheers, > Dzeri > >
