Hi all, Following up on the column updates design <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65> and the original discussion thread <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd like to start a focused discussion on how column update files should represent rows when deletion vectors (DVs) are present.
*Context* We've reached consensus on using a dense representation for column update files. When a column is updated, the column file contains values for all rows including unchanged rows. This avoids complex merge logic on the write path when successive updates target overlapping fields. The open question is: what should the column file contain at positions where the base file has deleted rows? There are two options. *Option 1*: Positional Alignment (row count matches base file) The column file has exactly base_file.record_count rows. Row N in the column file corresponds to row N in the base file. Deleted positions contain filler values (e.g., NULLs). Pros*:* - Stitching is a zero-copy column swap in Arrow - Works identically in every Arrow implementation (Java, Rust, Python, C++) - No _pos column needed - Engines apply their existing DV filter to both base and column file Cons*:* - Filler values at deleted positions skew Parquet footer statistics (null_count, avg_length) - Writes slightly more data than necessary (filler values for deleted rows) - Writer must know base_file.record_count to pad trailing deletions (base file metadata already available during write planning) *Option 2*: Applied Deletes (row count = live rows only) The column file contains only live rows (after applying DVs). A _pos column maps each row back to its ordinal position in the base file. Pros*:* - Only stores valid rows in column update files. - Parquet footer statistics are accurate (no skew from NULLs at deleted positions) - Slightly smaller file (no filler bytes) Cons*:* - _pos adds storage overhead (Encoding must be left to the file format) - Stitching requires a scatter operation to allocate a new array and place values at the correct positions - It's not zero-copy in Arrow and requires manipulation. - As it stands today this might be harder for non-Java engines (see section below) I investigated how three Iceberg implementations handle vectorized reading and what column stitching would require in each. The key architectural difference is how they expose Arrow memory: * Java/Spark**:* Spark's ColumnVector is an abstract class. We can override getInt(rowId)to redirect reads without copying data. This makes scatter operations appear "free" via virtual dispatch. My POC uses this approach. *PyIceberg:* Uses PyArrow's native arrays. I could not find a way to override what array[i] returns. PyArrow has take() (gather) but lacks a scatter() primitive (in the version we use). *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not trait objects). Int32Array::value(i) is a direct memory offset. Must materialize new arrays via ArrayBuilder for any non-trivial column manipulation. TL;DR: If we choose Option 2 (applied deletes), engines need a scatter method to stitch column files. I found the following implementations in Arrow which can be used to stitch. - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow 20.0.0) - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow 23.0.0) - I did not find scatter in arrow-rs. I'm still researching these options and would love to hear from everyone. Thanks, Anurag
