Hi all,

Following up on the column updates design
<https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
and
the original discussion thread
<https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd
like to start a focused discussion on how column update files should
represent rows when deletion vectors (DVs) are present.

*Context*

We've reached consensus on using a dense representation for column update
files. When a column is updated, the column file contains values for all
rows including unchanged rows. This avoids complex merge logic on the write
path when successive updates target overlapping fields.

The open question is: what should the column file contain at positions
where the base file has deleted rows? There are two options.

*Option 1*: Positional Alignment (row count matches base file)

The column file has exactly base_file.record_count rows. Row N in the
column file corresponds to row N in the base file. Deleted positions
contain filler values (e.g., NULLs).

Pros*:*

   - Stitching is a zero-copy column swap in Arrow
   - Works identically in every Arrow implementation (Java, Rust, Python,
   C++)
   - No _pos column needed
   - Engines apply their existing DV filter to both base and column file

Cons*:*

   - Filler values at deleted positions skew Parquet footer statistics
   (null_count, avg_length)
   - Writes slightly more data than necessary (filler values for deleted
   rows)
   - Writer must know base_file.record_count to pad trailing deletions
   (base file metadata already available during write planning)

*Option 2*: Applied Deletes (row count = live rows only)

The column file contains only live rows (after applying DVs). A _pos column
maps each row back to its ordinal position in the base file.

Pros*:*

   - Only stores valid rows in column update files.
   - Parquet footer statistics are accurate (no skew from NULLs at deleted
   positions)
   - Slightly smaller file (no filler bytes)

Cons*:*

   - _pos adds storage overhead (Encoding must be left to the file format)
   - Stitching requires a scatter operation to allocate a new array and
   place values at the correct positions
   - It's not zero-copy in Arrow and requires manipulation.
   - As it stands today this might be  harder for non-Java engines (see
   section below)

I investigated how three Iceberg implementations handle vectorized reading
and what column stitching would require in each. The key architectural
difference is how they expose Arrow memory:

* Java/Spark**:* Spark's ColumnVector is an abstract class. We can override
getInt(rowId)to redirect reads without copying data. This makes scatter
operations appear "free" via virtual dispatch. My POC uses this approach.

*PyIceberg:* Uses PyArrow's native arrays. I could not find a way
to override what array[i] returns. PyArrow has take() (gather) but lacks a
scatter() primitive (in the  version we use).

*iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not trait
objects). Int32Array::value(i) is a direct memory offset. Must materialize
new arrays via ArrayBuilder for any non-trivial column manipulation.

TL;DR: If we choose Option 2 (applied deletes), engines need a scatter
method to stitch column files. I found the following implementations in
Arrow which can be used to stitch.


   - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow 20.0.0)

   - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
   23.0.0)
   - I did not find scatter in arrow-rs.

I'm still researching these options and would love to hear from everyone.

Thanks,
Anurag

Reply via email to