Hey Iceberg Community, Thanks Anurag for starting a dedicated thread on this! Just a couple of thoughts from my side:
*Storage overhead:* *TLDR:* Regardless which way we go, storage overhead shouldn't be an issue. *Details:* I made a couple of measurements recently on the storage size of the update files. For numbers, see the "Update file size measurements" tab in this doc <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing> . 1) Writing _pos column: - File formats could hide the overhead for the _pos column, e.g. Parquet V2 has negligible (<0,5%) overhead - I recall there was an argument that having _pos even for pos-aligned updates might help us debug writer issues to see what is missing/icorrect in the update file - On the write path we already read and sort by _pos so there should be no overhead either In summary, I slightly lean towards always having _pos in the updates regardless what representation we choose. If we can agree on this, the relevant pros and cons around _pos are obsolete. 2) Filling rows/values for deleted rows: - I measured noticeable overhead when filling fields for deleted rows with NULLs. As always the overhead depends on many stuff - Deleted 5% of rows => writing NULLs has 2-3% storage overhead - Deleted 20% of rows => writing NULLs has 4-4.6% overhead - Since the consensus is constantly eliminating deletes, I think this overhead is acceptable *Factors to consider to choose representation* As described above, the storage size shouldn't be a factor here, I'd not consider the presence of _pos as pro or con. If we scratch the storage size related ones, here is what we have left: *1) Accuracy of stats* By filling missing rows with auxiliary values the stats can go off - I'm not that worried about Parquet footer stats, they represent the file itself, not the logical deletes on top - If we went for NULLs as filling values, we could set each field in a deleted row to NULL (not the entire row). As a result the column-level null count in the table metadata can be off because of the filling values. - Technically, if we want to, we can correct these stats if we collected the number of deleted rows and then adjust the null count by this number - In general, deletes make stats off anyway regardless of column updates (probably not the null count but the avg size, though) I don't think that this should be a decisive factor when choosing representation. *2) Complexity in general* a) Read - stitching - Positional alignment approach is straightforward for stitching both in vectorized and row readers - Stitch before applying deletes - Applied deletes approach seems more complicated initially - Scattering of update rows is required - Might not be supported for all the language implementations ATM. Thanks Anurag for taking a look! b) Write - Applied deletes approach is straightforward - Positional alignment approach has one major complexity: - when trailing rows are deleted, the writer currently has no information how many rows to fill - In the PoC we broadcast a "file path to row count" so that the writers can now if trailing rows have to be filled with nulls, and with how many (comparing to the _pos column) - In theory we could simply not fill trailing deleted rows, but then we have a hybrid approach between positional alignment and applied deletes. Probably, we don't want this complexity in the spec *3) Read and write performance* - I'm not expecting any difference in write perf - Read could have a toll on the "applied deletes" approach due to scattering. *@pvary* might have some more insights here. *Summary* I hope this summarizes all we have discussed from a different angle and might narrow down the areas to look for. I think the two main questions to sort out to make a decision are: 1) Can we find a way to take care of trailing deletes in "positional aligned" approach (or are we fine not filling trailing deletes) 2) What is the cost of scattering the update rows in the "applied deletes" approach 2/b) Is scattering feasible on all language implementations Best Regards, Gabor Anurag Mantripragada <[email protected]> ezt írta (időpont: 2026. máj. 20., Sze, 2:37): > Hi all, > > Following up on the column updates design > <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65> > and > the original discussion thread > <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd > like to start a focused discussion on how column update files should > represent rows when deletion vectors (DVs) are present. > > *Context* > > We've reached consensus on using a dense representation for column update > files. When a column is updated, the column file contains values for all > rows including unchanged rows. This avoids complex merge logic on the write > path when successive updates target overlapping fields. > > The open question is: what should the column file contain at positions > where the base file has deleted rows? There are two options. > > *Option 1*: Positional Alignment (row count matches base file) > > The column file has exactly base_file.record_count rows. Row N in the > column file corresponds to row N in the base file. Deleted positions > contain filler values (e.g., NULLs). > > Pros*:* > > - Stitching is a zero-copy column swap in Arrow > - Works identically in every Arrow implementation (Java, Rust, Python, > C++) > - No _pos column needed > - Engines apply their existing DV filter to both base and column file > > Cons*:* > > - Filler values at deleted positions skew Parquet footer statistics > (null_count, avg_length) > - Writes slightly more data than necessary (filler values for deleted > rows) > - Writer must know base_file.record_count to pad trailing deletions > (base file metadata already available during write planning) > > *Option 2*: Applied Deletes (row count = live rows only) > > The column file contains only live rows (after applying DVs). A _pos column > maps each row back to its ordinal position in the base file. > > Pros*:* > > - Only stores valid rows in column update files. > - Parquet footer statistics are accurate (no skew from NULLs at > deleted positions) > - Slightly smaller file (no filler bytes) > > Cons*:* > > - _pos adds storage overhead (Encoding must be left to the file format) > - Stitching requires a scatter operation to allocate a new array and > place values at the correct positions > - It's not zero-copy in Arrow and requires manipulation. > - As it stands today this might be harder for non-Java engines (see > section below) > > I investigated how three Iceberg implementations handle vectorized reading > and what column stitching would require in each. The key architectural > difference is how they expose Arrow memory: > > * Java/Spark**:* Spark's ColumnVector is an abstract class. We can > override getInt(rowId)to redirect reads without copying data. This makes > scatter operations appear "free" via virtual dispatch. My POC uses this > approach. > > *PyIceberg:* Uses PyArrow's native arrays. I could not find a way > to override what array[i] returns. PyArrow has take() (gather) but lacks a > scatter() primitive (in the version we use). > > *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not > trait objects). Int32Array::value(i) is a direct memory offset. Must > materialize new arrays via ArrayBuilder for any non-trivial column > manipulation. > > TL;DR: If we choose Option 2 (applied deletes), engines need a scatter > method to stitch column files. I found the following implementations in > Arrow which can be used to stitch. > > > - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow > 20.0.0) > > - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow > 23.0.0) > - I did not find scatter in arrow-rs. > > I'm still researching these options and would love to hear from everyone. > > Thanks, > Anurag > >
