Hi Anurag, Thanks for working on this.
Following up on the discussion around Option 1 (positional alignment), I want to see if we can handle equality deletes interact properly with the presence of column update files. This was identified as a problematic area and the original design doc chose not to support equality delete. I think that would be a major limitation. Also just identifying if a particular column participated in equality delete can be already expensive. Same for the opposite. Here I propose options to handle them and let me know if that makes sense. Let's look at the problem first. Consider this sequence: 1. Base file f0 contains columns A, B (sequence number seq0) 2. Equality delete dv0: DELETE WHERE B = b2 (sequence number seq1) 3. Full column update writes f1 for column B with positional alignment — NULLs at deleted positions (sequence number seq2) At read time, we have a problem: - We cannot evaluate dv0's predicate against f1, because f1 has new values for B — the predicate would either miss the NULL placeholders or falsely match new values that happen to equal the delete predicate but were written after the delete. - We cannot skip dv0, as we still need to resolve and filter out deleted rows In short: the equality delete's predicate is meaningless against the new column file, but its positional effect must still be enforced. To solve this, there are two options: Option 1: We evaluate the equality delete on column B old values from f0, applies the filter bitmap when reading column file f1. This gives us the correct result. Pros - Simple mental model. The equality delete always applies to original value set. No big change on write. Cons - We need to scan column B twice: one for the old values, and one for the new values. Option 2: At the column file generation time, since we already need to figure out the deleted rows (to place NULL fillers), we can just materialize the delete positions at the write time by creating a new position delete file (dv1) using the new sequence number (seq2) making the original equality delete become obsolete (seq2 > seq1). Pros - No need to load old column at read time. We just read f1 for B then apply positional delete from dv1, ignoring dv0. Cons - There is nuance regarding how we handle the delete sequence number. And we will have to do the same even for position delete in order to be consistent. Let me know your thoughts. This is a complex problem. And it's important to get it right. On 2026/05/20 00:37:07 Anurag Mantripragada wrote: > Hi all, > > Following up on the column updates design > <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65> > and > the original discussion thread > <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd > like to start a focused discussion on how column update files should > represent rows when deletion vectors (DVs) are present. > > *Context* > > We've reached consensus on using a dense representation for column update > files. When a column is updated, the column file contains values for all > rows including unchanged rows. This avoids complex merge logic on the write > path when successive updates target overlapping fields. > > The open question is: what should the column file contain at positions > where the base file has deleted rows? There are two options. > > *Option 1*: Positional Alignment (row count matches base file) > > The column file has exactly base_file.record_count rows. Row N in the > column file corresponds to row N in the base file. Deleted positions > contain filler values (e.g., NULLs). > > Pros*:* > > - Stitching is a zero-copy column swap in Arrow > - Works identically in every Arrow implementation (Java, Rust, Python, > C++) > - No _pos column needed > - Engines apply their existing DV filter to both base and column file > > Cons*:* > > - Filler values at deleted positions skew Parquet footer statistics > (null_count, avg_length) > - Writes slightly more data than necessary (filler values for deleted > rows) > - Writer must know base_file.record_count to pad trailing deletions > (base file metadata already available during write planning) > > *Option 2*: Applied Deletes (row count = live rows only) > > The column file contains only live rows (after applying DVs). A _pos column > maps each row back to its ordinal position in the base file. > > Pros*:* > > - Only stores valid rows in column update files. > - Parquet footer statistics are accurate (no skew from NULLs at deleted > positions) > - Slightly smaller file (no filler bytes) > > Cons*:* > > - _pos adds storage overhead (Encoding must be left to the file format) > - Stitching requires a scatter operation to allocate a new array and > place values at the correct positions > - It's not zero-copy in Arrow and requires manipulation. > - As it stands today this might be harder for non-Java engines (see > section below) > > I investigated how three Iceberg implementations handle vectorized reading > and what column stitching would require in each. The key architectural > difference is how they expose Arrow memory: > > * Java/Spark**:* Spark's ColumnVector is an abstract class. We can override > getInt(rowId)to redirect reads without copying data. This makes scatter > operations appear "free" via virtual dispatch. My POC uses this approach. > > *PyIceberg:* Uses PyArrow's native arrays. I could not find a way > to override what array[i] returns. PyArrow has take() (gather) but lacks a > scatter() primitive (in the version we use). > > *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not trait > objects). Int32Array::value(i) is a direct memory offset. Must materialize > new arrays via ArrayBuilder for any non-trivial column manipulation. > > TL;DR: If we choose Option 2 (applied deletes), engines need a scatter > method to stitch column files. I found the following implementations in > Arrow which can be used to stitch. > > > - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow 20.0.0) > > - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow > 23.0.0) > - I did not find scatter in arrow-rs. > > I'm still researching these options and would love to hear from everyone. > > Thanks, > Anurag >
