Hi Dzeri and Tamas,
Thank you for raising this and sharing your opinion! I'm not entirely sure
about the overall conclusions on the proposed way forward here, though. Let
me reflect to some of the details:
*Create new snapshot when computing statistics*
This is different from the model we use now. New snapshot is created when
data changes within the table. With adding stats, the data is intact but
additional metadata is added, so following the model, we don't create a new
snapshot. Also, statistics files are attached to a particular snapshot,
wouldn't be that intuitive to say, in snapshot X we added stats for
snapshot Y.
Also, I'm not entirely sure I understand how this would help with engine_A
overwriting stats written by engine_B. Would there be snapshot_X that
contains stats for snapshot_Y written by engine_Z? Not something we'd want.
*Allow for multiple statistics files to be bound to a snapshot*
With this design, how would we match stat files with engines? Would there
be a special string ID that describes the engine? Would there be a
different ID for different versions of the engine? Would these IDs be part
of the Iceberg spec, or would we rely on each engine knowing its own ID?
How would readers know which stat file to read? Would Impala version X.Y
know if it can read stat files written by Spark version A.B? I'm not sure
there is a good way of implementing this.
Not convinced this is the way forward.
*Enforce carry-over of unknown blob data into new puffin files*
>From the reader's perspective this should be fine, they can pick the blobs
they understand by blob type.
>From the writer's perspective I'd argue with "Simple to implement". It's
simple today, as whenever a writer creates stats for a snapshot, it commits
the computed stats for the snapshot unconditionally. With the carry-over
approach there are a couple of extra steps and difficulties:
- If there are existing stats for the snapshot, they have to be read
- After writing the merged stats to storage (the stats the writer
computed together with the blobs the writer doesn't know about) conflict
detection has to be performed before commit. Without this, in case some
other writer wrote some proprietary stuff we are expected to carry over, we
could lose that information. This requires not only conflict checks, but
conflict resolution, retries, etc. that complicates a process that is
pretty simple today.
But most importantly, I'm very hesitant to introduce support (potentially
into the spec too) of proprietary stuff we don't understand.
- Iceberg is foremost a specification (with a number of reference
implementations) that is powerful for cross-engine compatibility. This
means, whatever is in the spec is expected to be understood by engines
following the spec.
- Adding proprietary stuff to stats files helps a subset of proprietary
engines only. The design allows putting whatever proprietary stuff into
Puffin files, but once done, it's up to the proprietary writers to take
care of it.
- Once proprietary stuff made it into the Puffin files, I don't think
the spec should mandate engines to carry them over.
*Ways forward:*
1) Use the proprietary writer to calculate stats
In this particular case I assume there are 2 writers, one that writes
proprietary stuff to Puffin, another that follows the spec and calculates
stats into the same Puffin.
In your description you mention you ran ANALYZE TABLE on your table, but I
don't think that it's valid for Iceberg tables. For Iceberg tables, the
compute_table_stats procedure is for creating the table-level Puffin files.
You either avoid using this through Spark, or you run this first and after
this you run your proprietary writer (I think this is what you said you
were doing) and then your Puffin is as you expect.
2) Standardize proprietary stuff
As I mentioned, Iceberg is powerful for cross-engine compatibility. Your
proprietary stuff doesn't help other engines, hence not much point to add
support for the spec/table format to keep them. However, I think we can
examine what exactly you'd like to store in the Puffin files one-by-one and
then discuss if the community shows support to add them to the spec as
officially supported blob types. WDYT?
Best Regards,
Gabor
Tamás Máté <[email protected]> ezt írta (időpont: 2026. jún. 30., K, 19:41):
> Hi Dzeri,
>
> Thanks for writing this up. I agree that the stats-only case is the
> important one to separate from data-changing commits, and that replacement
> is the tricky part.
>
> My mental model for the lifecycle is:
>
> 1. An engine analyzes an existing snapshot.
> 2. It writes a statistics file for that snapshot.
> 3. It commits a new table metadata version that references the statistics
> file, without creating a new snapshot.
> 4. That metadata entry is carried forward until the snapshot expires, or
> until something explicitly replaces or removes it.
>
> A writer that does not understand a blob type has no basis to validate it.
> Because of that, I think dropping an unknown blob is safer than carrying it
> forward into a newly written statistics file, where it may become obsolete
> or misleading. I think we should be aggressive about replacing statistics,
> and users should be aware that running a stats-producing operation may
> replace the statistics for that snapshot and drop statistics written by
> another engine.
>
> I am also concerned about multiple statistics files per snapshot from a
> planning-latency perspective. Spark's current column-statistics planning
> path reads NDV from statistics file metadata, not by opening Puffin files
> ([SparkScan.estimateStatistics](
> https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L198-L245)).
> If multiple files require Puffin reads during planning, that would
> introduce a new planning-time I/O path. One thing to consider is whether
> statistics loading should move to table load time instead. This is metadata
> after all, and I would expect the relevant Puffin metadata to be small,
> probably hundreds of KB at most.
>
> Overall, I think creating a snapshot when statistics are written is the
> simplest model and makes the most sense to me. I would tie statistics to
> snapshots so they can be expired with the snapshots they belong to. Also,
> because a statistics file can already contain multiple blobs, allowing
> multiple statistics files per snapshot feels similar to allowing more
> blobs, but with extra file-level lifecycle and planning cost. Problems
> could quickly arise if engine A drops stats X and Y, and then engine B,
> which expects X, Y, and Z together, later finds only Z.
>
> What do you think?
>
> Best,
> Tamas
>
> On Thu, 25 Jun 2026 at 12:10, dzeri96 <[email protected]> wrote:
>
>>
>> Hi everyone,
>>
>> I've recently started a discussion on Slack and was advised to post in
>> the dev mailing list.
>> As puffin/statistics files are starting to catch on, we are bound to come
>> across situations where one writer wants to create a new statistics file
>> while some data which it might not understand is already present in the
>> current snapshot's statistics file. I've come across this problem in real
>> life, when I ran `ANALYZE TABLE` in iceberg-spark, which created a new
>> metadata file and replaced my proprietary index data with its own.
>> You could argue that a single type of writer is expected for a table, but
>> on the other hand, the spirit of Iceberg is portability. We can't know
>> who's accessing the table and possibly corrupting its (statistics-)data.
>>
>> Before I get into the proposed solutions, I think it's important to
>> distinguish two scenarios in which statistics files are being written:
>> data-changing and non-data-changing.
>> For data-changing scenarios, I think it's reasonable to assume that old
>> statistics files are no longer valid, and are therefore OK to replace. In
>> the rest of this email, I will focus on scenarios where statistics are
>> being generated and attached to the current snapshot via a new metadata
>> file, as these are the problematic ones.
>>
>> After a short discussion in Slack, we roughly see three possible
>> solutions. I think all of them require a change to the iceberg spec, but
>> with varying gravity:
>>
>> 1. Enforce carry-over of unknown blob data into new puffin files.
>> Pros:
>> - Backwards-compatible reads, not only in terms of the
>> iceberg spec, but also in terms of statistics files semantics.
>> - Simple to implement because blob-level metadata is already
>> available.
>> - One reader could potentially understand statistics
>> blobs calculated by different writers.
>> Cons:
>> - Write amplification.
>> - Conflict resolution might require re-writing the whole file
>> again.
>>
>> 2. Allow for multiple statistics files to be bound to a snapshot.
>> Pros:
>> - Avoids write amplification.
>> - Each writer cares only about its own statistics file.
>> - Finding relevant statistics files is easy thanks to
>> file-level metadata.
>> - One reader could understand statistics files written by
>> different writers.
>> Cons:
>> - Backwards-incompatible reads.
>>
>> 3. Create new snapshot when computing statistics.
>> Pros:
>> - Avoids write amplification.
>> - Each writer cares only about its own statistics files.
>> Cons:
>> - Requires readers to iterate over past snapshots in order to
>> find last valid entry written by a compatible writer.
>>
>> I've definitely left some pros and cons out, but you can roughly map
>> these cases to ways we handle existing file types (metadata, manifest
>> lists, manifests). I'm sure people who have spent time designing the spec
>> can more easily list out the possible pitfalls. In my humble opinion, #3
>> might be the most straightforward, but #2 is what I initially expected from
>> the spec. We are doing #1 internally because it's the only thing we can do
>> in the current situation.
>>
>> Let me know what you think.
>> Cheers,
>> Dzeri
>>
>>