Minor addition to the storage overhead with writing NULLs as filling values: If we assume Parquet V2 and we write the _pos to the update file, with 20% of the rows deleted will totally amortize the storage overhead of writing NULLs. The reason is that with the "positional aligned" updates we have a complete sequence of the _pos that is efficiently compressed with delta encoding, while with the "applied deletes" approach there are holes in the _pos sequence, hence delta encoding is slightly less efficient.
Just some number to visualize with 20% deletion ratio: 1 int col _pos + 1 int Omit deleted rows 9735745 10111232 Null deleted rows 10189522 10205497 Overhead 4,66% 0,93% Gabor Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 20., Sze, 16:50): > Hey Iceberg Community, > > Thanks Anurag for starting a dedicated thread on this! Just a couple of > thoughts from my side: > > *Storage overhead:* > *TLDR:* Regardless which way we go, storage overhead shouldn't be an > issue. > > *Details:* > I made a couple of measurements recently on the storage size of the update > files. For numbers, see the "Update file size measurements" tab in this > doc > <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing> > . > > 1) Writing _pos column: > > - File formats could hide the overhead for the _pos column, e.g. > Parquet V2 has negligible (<0,5%) overhead > - I recall there was an argument that having _pos even for pos-aligned > updates might help us debug writer issues to see what is missing/icorrect > in the update file > - On the write path we already read and sort by _pos so there should > be no overhead either > > In summary, I slightly lean towards always having _pos in the updates > regardless what representation we choose. > If we can agree on this, the relevant pros and cons around _pos are > obsolete. > > 2) Filling rows/values for deleted rows: > > - I measured noticeable overhead when filling fields for deleted rows > with NULLs. As always the overhead depends on many stuff > - Deleted 5% of rows => writing NULLs has 2-3% storage overhead > - Deleted 20% of rows => writing NULLs has 4-4.6% overhead > - Since the consensus is constantly eliminating deletes, I think this > overhead is acceptable > > *Factors to consider to choose representation* > As described above, the storage size shouldn't be a factor here, I'd not > consider the presence of _pos as pro or con. > > If we scratch the storage size related ones, here is what we have left: > > *1) Accuracy of stats* > By filling missing rows with auxiliary values the stats can go off > > - I'm not that worried about Parquet footer stats, they represent the > file itself, not the logical deletes on top > - If we went for NULLs as filling values, we could set each field in a > deleted row to NULL (not the entire row). As a result the column-level null > count in the table metadata can be off because of the filling values. > - Technically, if we want to, we can correct these stats if we > collected the number of deleted rows and then adjust the null count by this > number > - In general, deletes make stats off anyway regardless of column > updates (probably not the null count but the avg size, though) > > I don't think that this should be a decisive factor when choosing > representation. > > *2) Complexity in general* > a) Read - stitching > > - Positional alignment approach is straightforward for stitching both > in vectorized and row readers > - Stitch before applying deletes > - Applied deletes approach seems more complicated initially > - Scattering of update rows is required > - Might not be supported for all the language implementations ATM. > Thanks Anurag for taking a look! > > b) Write > > - Applied deletes approach is straightforward > - Positional alignment approach has one major complexity: > - when trailing rows are deleted, the writer currently has no > information how many rows to fill > - In the PoC we broadcast a "file path to row count" so that the > writers can now if trailing rows have to be filled with nulls, and with > how > many (comparing to the _pos column) > - In theory we could simply not fill trailing deleted rows, but > then we have a hybrid approach between positional alignment and applied > deletes. Probably, we don't want this complexity in the spec > > *3) Read and write performance* > > - I'm not expecting any difference in write perf > - Read could have a toll on the "applied deletes" approach due to > scattering. *@pvary* might have some more insights here. > > *Summary* > I hope this summarizes all we have discussed from a different angle and > might narrow down the areas to look for. I think the two main questions to > sort out to make a decision are: > 1) Can we find a way to take care of trailing deletes in "positional > aligned" approach (or are we fine not filling trailing deletes) > 2) What is the cost of scattering the update rows in the "applied > deletes" approach > 2/b) Is scattering feasible on all language implementations > > Best Regards, > Gabor > > > Anurag Mantripragada <[email protected]> ezt írta (időpont: > 2026. máj. 20., Sze, 2:37): > >> Hi all, >> >> Following up on the column updates design >> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65> >> and >> the original discussion thread >> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd >> like to start a focused discussion on how column update files should >> represent rows when deletion vectors (DVs) are present. >> >> *Context* >> >> We've reached consensus on using a dense representation for column >> update files. When a column is updated, the column file contains values for >> all rows including unchanged rows. This avoids complex merge logic on the >> write path when successive updates target overlapping fields. >> >> The open question is: what should the column file contain at positions >> where the base file has deleted rows? There are two options. >> >> *Option 1*: Positional Alignment (row count matches base file) >> >> The column file has exactly base_file.record_count rows. Row N in the >> column file corresponds to row N in the base file. Deleted positions >> contain filler values (e.g., NULLs). >> >> Pros*:* >> >> - Stitching is a zero-copy column swap in Arrow >> - Works identically in every Arrow implementation (Java, Rust, >> Python, C++) >> - No _pos column needed >> - Engines apply their existing DV filter to both base and column file >> >> Cons*:* >> >> - Filler values at deleted positions skew Parquet footer statistics >> (null_count, avg_length) >> - Writes slightly more data than necessary (filler values for deleted >> rows) >> - Writer must know base_file.record_count to pad trailing deletions >> (base file metadata already available during write planning) >> >> *Option 2*: Applied Deletes (row count = live rows only) >> >> The column file contains only live rows (after applying DVs). A _pos column >> maps each row back to its ordinal position in the base file. >> >> Pros*:* >> >> - Only stores valid rows in column update files. >> - Parquet footer statistics are accurate (no skew from NULLs at >> deleted positions) >> - Slightly smaller file (no filler bytes) >> >> Cons*:* >> >> - _pos adds storage overhead (Encoding must be left to the file >> format) >> - Stitching requires a scatter operation to allocate a new array and >> place values at the correct positions >> - It's not zero-copy in Arrow and requires manipulation. >> - As it stands today this might be harder for non-Java engines (see >> section below) >> >> I investigated how three Iceberg implementations handle vectorized >> reading and what column stitching would require in each. The key >> architectural difference is how they expose Arrow memory: >> >> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can >> override getInt(rowId)to redirect reads without copying data. This makes >> scatter operations appear "free" via virtual dispatch. My POC uses this >> approach. >> >> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way >> to override what array[i] returns. PyArrow has take() (gather) but lacks >> a scatter() primitive (in the version we use). >> >> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not >> trait objects). Int32Array::value(i) is a direct memory offset. Must >> materialize new arrays via ArrayBuilder for any non-trivial column >> manipulation. >> >> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter >> method to stitch column files. I found the following implementations in >> Arrow which can be used to stitch. >> >> >> - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow >> 20.0.0) >> >> - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow >> 23.0.0) >> - I did not find scatter in arrow-rs. >> >> I'm still researching these options and would love to hear from everyone. >> >> Thanks, >> Anurag >> >>
