Re: [DISCUSS] Handling of existing statistics files

dzeri96 Thu, 02 Jul 2026 04:31:53 -0700

Alright, both of you raised some valid points.

Gabor drew my attention to the list of "supported" blob types in the spec and 
it made me realize that the people who implemented deletion vectors kind of had 
the same problem we're discussing. They had puffin files which could not share 
the same lifecycle as regular statistics files as it is now, so they stored 
them in manifest files. This is perfectly reasonable, but it just takes some 
time to get the mental model right and separate statistics files from the 
iceberg puffin spec. Even still, this distinction stays blurry because the 
supported blob types are defined at a lower level than they should be in my 
opinion. Reading Datasketches and deletion vector data happens at two separate 
points of the read process. In my opinion, it's unlikely that someone is 
writing a single instance of an "iceberg puffin reader", so blob types should 
be defined at their respective place of access. For example, the deletion 
vector type should be defined together with manifest files.


The reason why I'm saying this is not just to re-organize the documentation, 
but to make a case that statistics files should be handled more like large 
properties files with arbitrary information, than a fixed spec that has to be 
interpretable by every reader. And what do we do with custom properties 
currently? We carry them over, no matter the change. I'm not the only person 
that understood statistics files like this. Dremio, for example, had an article 
(https://www.dremio.com/blog/extending-apache-iceberg-best-practices-for-storing-and-discovering-custom-metadata/)
 on this, and it coincides with my view. The way they solved our problem is by 
adding a pointer to their latest puffin file in the table properties. I don't 
like this because it doesn't work with time travel and breaks file cleanup 
tasks.

Going back to your feedback, I think a realistic approach to achieving what I 
suggested is to indeed have multiple puffin files per snapshot. That way, we 
solve the write amplification and the commit problem (though I believe there 
still are edge cases, just as there are with custom properties). The way we 
handle proprietary metadata is either by going trough the blob types in order 
to find a supported file, or by adding a new file-level field that saves the 
file's spec. One of these specs would be the "iceberg statistics spec" with 
datasketches, and everything else is left to the reader for interpretation. 
Versions would be handled by the respective spec, just like iceberg does now.

Overall, I think it's important to reaffirm that my goal is not for everyone to 
understand each-other. It's just to prevent systems from  deleting each-others' 
metadata. An added benefit would be having snapshot-specific custom properties. 
I think that's a pretty nice bonus.

Let me know what you think,
Dzeri

On Wednesday, July 1st, 2026 at 1:01 PM, Gábor Kaszab <[email protected]> 
wrote:

> Hi Dzeri and Tamas,
> Thank you for raising this and sharing your opinion! I'm not entirely sure 
> about the overall conclusions on the proposed way forward here, though. Let 
> me reflect to some of the details:
>
> Create new snapshot when computing statistics
> This is different from the model we use now. New snapshot is created when 
> data changes within the table. With adding stats, the data is intact but 
> additional metadata is added, so following the model, we don't create a new 
> snapshot. Also, statistics files are attached to a particular snapshot, 
> wouldn't be that intuitive to say, in snapshot X we added stats for snapshot 
> Y.
> Also, I'm not entirely sure I understand how this would help with engine_A 
> overwriting stats written by engine_B. Would there be snapshot_X that 
> contains stats for snapshot_Y written by engine_Z? Not something we'd want.
>
> Allow for multiple statistics files to be bound to a snapshot
> With this design, how would we match stat files with engines? Would there be 
> a special string ID that describes the engine? Would there be a different ID 
> for different versions of the engine? Would these IDs be part of the Iceberg 
> spec, or would we rely on each engine knowing its own ID? How would readers 
> know which stat file to read? Would Impala version X.Y know if it can read 
> stat files written by Spark version A.B? I'm not sure there is a good way of 
> implementing this.
> Not convinced this is the way forward.
>
> Enforce carry-over of unknown blob data into new puffin files
> From the reader's perspective this should be fine, they can pick the blobs 
> they understand by blob type.
> From the writer's perspective I'd argue with "Simple to implement". It's 
> simple today, as whenever a writer creates stats for a snapshot, it commits 
> the computed stats for the snapshot unconditionally. With the carry-over 
> approach there are a couple of extra steps and difficulties:
> - If there are existing stats for the snapshot, they have to be read
> - After writing the merged stats to storage (the stats the writer computed 
> together with the blobs the writer doesn't know about) conflict detection has 
> to be performed before commit. Without this, in case some other writer wrote 
> some proprietary stuff we are expected to carry over, we could lose that 
> information. This requires not only conflict checks, but conflict resolution, 
> retries, etc. that complicates a process that is pretty simple today.
>
> But most importantly, I'm very hesitant to introduce support (potentially 
> into the spec too) of proprietary stuff we don't understand.
> - Iceberg is foremost a specification (with a number of reference 
> implementations) that is powerful for cross-engine compatibility. This means, 
> whatever is in the spec is expected to be understood by engines following the 
> spec.
> - Adding proprietary stuff to stats files helps a subset of proprietary 
> engines only. The design allows putting whatever proprietary stuff into 
> Puffin files, but once done, it's up to the proprietary writers to take care 
> of it.
> - Once proprietary stuff made it into the Puffin files, I don't think the 
> spec should mandate engines to carry them over.
>
> Ways forward:
> 1) Use the proprietary writer to calculate stats
> In this particular case I assume there are 2 writers, one that writes 
> proprietary stuff to Puffin, another that follows the spec and calculates 
> stats into the same Puffin.
> In your description you mention you ran ANALYZE TABLE on your table, but I 
> don't think that it's valid for Iceberg tables. For Iceberg tables, the 
> compute_table_stats procedure is for creating the table-level Puffin files. 
> You either avoid using this through Spark, or you run this first and after 
> this you run your proprietary writer (I think this is what you said you were 
> doing) and then your Puffin is as you expect.
>
> 2) Standardize proprietary stuff
> As I mentioned, Iceberg is powerful for cross-engine compatibility. Your 
> proprietary stuff doesn't help other engines, hence not much point to add 
> support for the spec/table format to keep them. However, I think we can 
> examine what exactly you'd like to store in the Puffin files one-by-one and 
> then discuss if the community shows support to add them to the spec as 
> officially supported blob types. WDYT?
>
> Best Regards,
> Gabor
>
> Tamás Máté <[email protected]> ezt írta (időpont: 2026. jún. 30., K, 19:41):
>
> > Hi Dzeri,
> >
> > Thanks for writing this up. I agree that the stats-only case is the 
> > important one to separate from data-changing commits, and that replacement 
> > is the tricky part.
> >
> > My mental model for the lifecycle is:
> >
> > 1. An engine analyzes an existing snapshot.
> > 2. It writes a statistics file for that snapshot.
> > 3. It commits a new table metadata version that references the statistics 
> > file, without creating a new snapshot.
> > 4. That metadata entry is carried forward until the snapshot expires, or 
> > until something explicitly replaces or removes it.
> >
> > A writer that does not understand a blob type has no basis to validate it. 
> > Because of that, I think dropping an unknown blob is safer than carrying it 
> > forward into a newly written statistics file, where it may become obsolete 
> > or misleading. I think we should be aggressive about replacing statistics, 
> > and users should be aware that running a stats-producing operation may 
> > replace the statistics for that snapshot and drop statistics written by 
> > another engine.
> >
> > I am also concerned about multiple statistics files per snapshot from a 
> > planning-latency perspective. Spark's current column-statistics planning 
> > path reads NDV from statistics file metadata, not by opening Puffin files 
> > ([SparkScan.estimateStatistics](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L198-L245)).
> >  If multiple files require Puffin reads during planning, that would 
> > introduce a new planning-time I/O path. One thing to consider is whether 
> > statistics loading should move to table load time instead. This is metadata 
> > after all, and I would expect the relevant Puffin metadata to be small, 
> > probably hundreds of KB at most.
> >
> > Overall, I think creating a snapshot when statistics are written is the 
> > simplest model and makes the most sense to me. I would tie statistics to 
> > snapshots so they can be expired with the snapshots they belong to. Also, 
> > because a statistics file can already contain multiple blobs, allowing 
> > multiple statistics files per snapshot feels similar to allowing more 
> > blobs, but with extra file-level lifecycle and planning cost. Problems 
> > could quickly arise if engine A drops stats X and Y, and then engine B, 
> > which expects X, Y, and Z together, later finds only Z.
> >
> > What do you think?
> >
> > Best,
> > Tamas
> >
> > On Thu, 25 Jun 2026 at 12:10, dzeri96 <[email protected]> wrote:
> >
> > >
> > > Hi everyone,
> > >
> > > I've recently started a discussion on Slack and was advised to post in 
> > > the dev mailing list.
> > > As puffin/statistics files are starting to catch on, we are bound to come 
> > > across situations where one writer wants to create a new statistics file 
> > > while some data which it might not understand is already present in the 
> > > current snapshot's statistics file. I've come across this problem in real 
> > > life, when I ran `ANALYZE TABLE` in iceberg-spark, which created a new 
> > > metadata file and replaced my proprietary index data with its own.
> > > You could argue that a single type of writer is expected for a table, but 
> > > on the other hand, the spirit of Iceberg is portability. We can't know 
> > > who's accessing the table and possibly corrupting its (statistics-)data.
> > >
> > > Before I get into the proposed solutions, I think it's important to 
> > > distinguish two scenarios in which statistics files are being written: 
> > > data-changing and non-data-changing.
> > > For data-changing scenarios, I think it's reasonable to assume that old 
> > > statistics files are no longer valid, and are therefore OK to replace. In 
> > > the rest of this email, I will focus on scenarios where statistics are 
> > > being generated and attached to the current snapshot via a new metadata 
> > > file, as these are the problematic ones.
> > >
> > > After a short discussion in Slack, we roughly see three possible 
> > > solutions. I think all of them require a change to the iceberg spec, but 
> > > with varying gravity:
> > >
> > > 1. Enforce carry-over of unknown blob data into new puffin files.
> > > Pros:
> > > - Backwards-compatible reads, not only in terms of the iceberg spec, but 
> > > also in terms of statistics files semantics.
> > > - Simple to implement because blob-level metadata is already available.
> > > - One reader could potentially understand statistics blobs calculated by 
> > > different writers.
> > > Cons:
> > > - Write amplification.
> > > - Conflict resolution might require re-writing the whole file again.
> > >
> > > 2. Allow for multiple statistics files to be bound to a snapshot.
> > > Pros:
> > > - Avoids write amplification.
> > > - Each writer cares only about its own statistics file.
> > > - Finding relevant statistics files is easy thanks to file-level metadata.
> > > - One reader could understand statistics files written by different 
> > > writers.
> > > Cons:
> > > - Backwards-incompatible reads.
> > >
> > > 3. Create new snapshot when computing statistics.
> > > Pros:
> > > - Avoids write amplification.
> > > - Each writer cares only about its own statistics files.
> > > Cons:
> > > - Requires readers to iterate over past snapshots in order to find last 
> > > valid entry written by a compatible writer.
> > >
> > > I've definitely left some pros and cons out, but you can roughly map 
> > > these cases to ways we handle existing file types (metadata, manifest 
> > > lists, manifests). I'm sure people who have spent time designing the spec 
> > > can more easily list out the possible pitfalls. In my humble opinion, #3 
> > > might be the most straightforward, but #2 is what I initially expected 
> > > from the spec. We are doing #1 internally because it's the only thing we 
> > > can do in the current situation.
> > >
> > > Let me know what you think.
> > > Cheers,
> > > Dzeri
> > >

publickey - [email protected] - 0x5E7E90EC.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] Handling of existing statistics files

Reply via email to