Re: [Discuss] Column Update File Representation

Xiening Dai Wed, 27 May 2026 23:48:25 -0700

Hi Anurag, 

Thanks for working on this.

Following up on the discussion around Option 1 (positional alignment), I want 
to see if we can handle equality deletes interact properly with the presence of 
column update files. This was identified as a problematic area and the original 
design doc chose not to support equality delete. I think that would be a major 
limitation. Also just identifying if a particular column participated in 
equality delete can be already expensive. Same for the opposite. 

Here I propose options to handle them and let me know if that makes sense.

Let's look at the problem first. Consider this sequence:

  1. Base file f0 contains columns A, B (sequence number seq0)
  2. Equality delete dv0: DELETE WHERE B = b2 (sequence number seq1)
  3. Full column update writes f1 for column B with positional alignment — 
NULLs at deleted positions (sequence number seq2)

At read time, we have a problem:

  - We cannot evaluate dv0's predicate against f1, because f1 has new values 
for B — the predicate would either miss the NULL placeholders or falsely match 
new values that happen to equal the delete predicate but were written after the 
delete.
  - We cannot skip dv0, as we still need to resolve and filter out deleted rows

In short: the equality delete's predicate is meaningless against the new column 
file, but its positional effect must still be enforced.

To solve this, there are two options:

Option 1:

We evaluate the equality delete on column B old values from f0, applies the 
filter bitmap when reading column file f1. This gives us the correct result.

Pros - Simple mental model. The equality delete always applies to original 
value set. No big change on write.
Cons - We need to scan column B twice: one for the old values, and one for the 
new values.

Option 2:

At the column file generation time, since we already need to figure out the 
deleted rows (to place NULL fillers), we can just materialize the delete 
positions at the write time by creating a new position delete file (dv1) using 
the new sequence number (seq2) making the original equality delete become 
obsolete (seq2 > seq1). 

Pros - No need to load old column at read time. We just read f1 for B then 
apply positional delete from dv1, ignoring dv0.
Cons - There is nuance regarding how we handle the delete sequence number. And 
we will have to do the same even for position delete in order to be consistent.

Let me know your thoughts. This is a complex problem. And it's important to get 
it right.

On 2026/05/20 00:37:07 Anurag Mantripragada wrote:
> Hi all,
> 
> Following up on the column updates design
> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
> and
> the original discussion thread
> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd
> like to start a focused discussion on how column update files should
> represent rows when deletion vectors (DVs) are present.
> 
> *Context*
> 
> We've reached consensus on using a dense representation for column update
> files. When a column is updated, the column file contains values for all
> rows including unchanged rows. This avoids complex merge logic on the write
> path when successive updates target overlapping fields.
> 
> The open question is: what should the column file contain at positions
> where the base file has deleted rows? There are two options.
> 
> *Option 1*: Positional Alignment (row count matches base file)
> 
> The column file has exactly base_file.record_count rows. Row N in the
> column file corresponds to row N in the base file. Deleted positions
> contain filler values (e.g., NULLs).
> 
> Pros*:*
> 
>    - Stitching is a zero-copy column swap in Arrow
>    - Works identically in every Arrow implementation (Java, Rust, Python,
>    C++)
>    - No _pos column needed
>    - Engines apply their existing DV filter to both base and column file
> 
> Cons*:*
> 
>    - Filler values at deleted positions skew Parquet footer statistics
>    (null_count, avg_length)
>    - Writes slightly more data than necessary (filler values for deleted
>    rows)
>    - Writer must know base_file.record_count to pad trailing deletions
>    (base file metadata already available during write planning)
> 
> *Option 2*: Applied Deletes (row count = live rows only)
> 
> The column file contains only live rows (after applying DVs). A _pos column
> maps each row back to its ordinal position in the base file.
> 
> Pros*:*
> 
>    - Only stores valid rows in column update files.
>    - Parquet footer statistics are accurate (no skew from NULLs at deleted
>    positions)
>    - Slightly smaller file (no filler bytes)
> 
> Cons*:*
> 
>    - _pos adds storage overhead (Encoding must be left to the file format)
>    - Stitching requires a scatter operation to allocate a new array and
>    place values at the correct positions
>    - It's not zero-copy in Arrow and requires manipulation.
>    - As it stands today this might be  harder for non-Java engines (see
>    section below)
> 
> I investigated how three Iceberg implementations handle vectorized reading
> and what column stitching would require in each. The key architectural
> difference is how they expose Arrow memory:
> 
> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can override
> getInt(rowId)to redirect reads without copying data. This makes scatter
> operations appear "free" via virtual dispatch. My POC uses this approach.
> 
> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
> to override what array[i] returns. PyArrow has take() (gather) but lacks a
> scatter() primitive (in the  version we use).
> 
> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not trait
> objects). Int32Array::value(i) is a direct memory offset. Must materialize
> new arrays via ArrayBuilder for any non-trivial column manipulation.
> 
> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter
> method to stitch column files. I found the following implementations in
> Arrow which can be used to stitch.
> 
> 
>    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow 20.0.0)
> 
>    - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
>    23.0.0)
>    - I did not find scatter in arrow-rs.
> 
> I'm still researching these options and would love to hear from everyone.
> 
> Thanks,
> Anurag
>

Re: [Discuss] Column Update File Representation

Reply via email to