Re: [Discuss] Column Update File Representation

Gábor Kaszab Thu, 28 May 2026 05:50:07 -0700

Hi Xiening,

Thank you for thinking this through and sharing the options you found! Let
me share my take on them below:


Option 1:
One advantage of the dense representation of column updates is that when
you update column_x, you no longer have to read column_x from the base
file. With the design you propose, you do, as you describe. Also, would you
allow writing eq-deletes after we have column updates? Because then you
have to keep track of the sequence of column updates and apply eq-deletes
on each stage.
Steps:
  1) Create base file, seq_num=0
  2) eq-delete some rows on col_A, seq_num=1
  3) column update col_A, seq_num=2
  4) eq-delete some other rows on col_A, seqnum=3
  5) column update col_A, seq_num=4
Now, with your proposal, if I'm not mistaken we have to read all the files
for col_A, also in step 5_ we can't just drop the old column update file,
because it is needed for eq-deletes.

I think this is too much complexity, and too inefficient to keep reading
old columns that we have already updated.

Option 2:
"since we already need to figure out the deleted rows" => in the writer
there isn't much to figure out for filling with nulls. If the row didn't
come in (based on the _pos column) we fill it with null. We don't know if
the row is missing because of DVs or eq-deletes.
What you propose here is basically rewriting eq-deletes into DVs
seamlessly. I think this is something we should leave to the users to do
themselves, it might be confusing to do it under the hood.

Let me know what you think!
Gabor

Xiening Dai <[email protected]> ezt írta (időpont: 2026. máj. 28., Cs, 8:48):

> Hi Anurag,
>
> Thanks for working on this.
>
> Following up on the discussion around Option 1 (positional alignment), I
> want to see if we can handle equality deletes interact properly with the
> presence of column update files. This was identified as a problematic area
> and the original design doc chose not to support equality delete. I think
> that would be a major limitation. Also just identifying if a particular
> column participated in equality delete can be already expensive. Same for
> the opposite.
>
> Here I propose options to handle them and let me know if that makes sense.
>
> Let's look at the problem first. Consider this sequence:
>
>   1. Base file f0 contains columns A, B (sequence number seq0)
>   2. Equality delete dv0: DELETE WHERE B = b2 (sequence number seq1)
>   3. Full column update writes f1 for column B with positional alignment —
> NULLs at deleted positions (sequence number seq2)
>
> At read time, we have a problem:
>
>   - We cannot evaluate dv0's predicate against f1, because f1 has new
> values for B — the predicate would either miss the NULL placeholders or
> falsely match new values that happen to equal the delete predicate but were
> written after the delete.
>   - We cannot skip dv0, as we still need to resolve and filter out deleted
> rows
>
> In short: the equality delete's predicate is meaningless against the new
> column file, but its positional effect must still be enforced.
>
> To solve this, there are two options:
>
> Option 1:
>
> We evaluate the equality delete on column B old values from f0, applies
> the filter bitmap when reading column file f1. This gives us the correct
> result.
>
> Pros - Simple mental model. The equality delete always applies to original
> value set. No big change on write.
> Cons - We need to scan column B twice: one for the old values, and one for
> the new values.
>
> Option 2:
>
> At the column file generation time, since we already need to figure out
> the deleted rows (to place NULL fillers), we can just materialize the
> delete positions at the write time by creating a new position delete file
> (dv1) using the new sequence number (seq2) making the original equality
> delete become obsolete (seq2 > seq1).
>
> Pros - No need to load old column at read time. We just read f1 for B then
> apply positional delete from dv1, ignoring dv0.
> Cons - There is nuance regarding how we handle the delete sequence number.
> And we will have to do the same even for position delete in order to be
> consistent.
>
> Let me know your thoughts. This is a complex problem. And it's important
> to get it right.
>
> On 2026/05/20 00:37:07 Anurag Mantripragada wrote:
> > Hi all,
> >
> > Following up on the column updates design
> > <
> https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65
> >
> > and
> > the original discussion thread
> > <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd
> > like to start a focused discussion on how column update files should
> > represent rows when deletion vectors (DVs) are present.
> >
> > *Context*
> >
> > We've reached consensus on using a dense representation for column update
> > files. When a column is updated, the column file contains values for all
> > rows including unchanged rows. This avoids complex merge logic on the
> write
> > path when successive updates target overlapping fields.
> >
> > The open question is: what should the column file contain at positions
> > where the base file has deleted rows? There are two options.
> >
> > *Option 1*: Positional Alignment (row count matches base file)
> >
> > The column file has exactly base_file.record_count rows. Row N in the
> > column file corresponds to row N in the base file. Deleted positions
> > contain filler values (e.g., NULLs).
> >
> > Pros*:*
> >
> >    - Stitching is a zero-copy column swap in Arrow
> >    - Works identically in every Arrow implementation (Java, Rust, Python,
> >    C++)
> >    - No _pos column needed
> >    - Engines apply their existing DV filter to both base and column file
> >
> > Cons*:*
> >
> >    - Filler values at deleted positions skew Parquet footer statistics
> >    (null_count, avg_length)
> >    - Writes slightly more data than necessary (filler values for deleted
> >    rows)
> >    - Writer must know base_file.record_count to pad trailing deletions
> >    (base file metadata already available during write planning)
> >
> > *Option 2*: Applied Deletes (row count = live rows only)
> >
> > The column file contains only live rows (after applying DVs). A _pos
> column
> > maps each row back to its ordinal position in the base file.
> >
> > Pros*:*
> >
> >    - Only stores valid rows in column update files.
> >    - Parquet footer statistics are accurate (no skew from NULLs at
> deleted
> >    positions)
> >    - Slightly smaller file (no filler bytes)
> >
> > Cons*:*
> >
> >    - _pos adds storage overhead (Encoding must be left to the file
> format)
> >    - Stitching requires a scatter operation to allocate a new array and
> >    place values at the correct positions
> >    - It's not zero-copy in Arrow and requires manipulation.
> >    - As it stands today this might be  harder for non-Java engines (see
> >    section below)
> >
> > I investigated how three Iceberg implementations handle vectorized
> reading
> > and what column stitching would require in each. The key architectural
> > difference is how they expose Arrow memory:
> >
> > * Java/Spark**:* Spark's ColumnVector is an abstract class. We can
> override
> > getInt(rowId)to redirect reads without copying data. This makes scatter
> > operations appear "free" via virtual dispatch. My POC uses this approach.
> >
> > *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
> > to override what array[i] returns. PyArrow has take() (gather) but lacks
> a
> > scatter() primitive (in the  version we use).
> >
> > *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not
> trait
> > objects). Int32Array::value(i) is a direct memory offset. Must
> materialize
> > new arrays via ArrayBuilder for any non-trivial column manipulation.
> >
> > TL;DR: If we choose Option 2 (applied deletes), engines need a scatter
> > method to stitch column files. I found the following implementations in
> > Arrow which can be used to stitch.
> >
> >
> >    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow
> 20.0.0)
> >
> >    - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
> >    23.0.0)
> >    - I did not find scatter in arrow-rs.
> >
> > I'm still researching these options and would love to hear from everyone.
> >
> > Thanks,
> > Anurag
> >
>

Re: [Discuss] Column Update File Representation

Reply via email to