Re: [Discuss] Column Update File Representation

Andrei Tserakhau via dev Wed, 20 May 2026 08:28:44 -0700

Hello All,

Coming in from the Delta side -- we ran into this exact design
choice about a year ago and converged on Option 1. +1 from the
prior-art angle, with measurements and a cross-engine note.


== What we measured on Delta ==

We did a write-path PoC of dense, row-aligned column files
(never landed in production, but the design tradeoffs got
exercised end-to-end). We compared positional alignment vs
an applied-deletes / scatter style and chose positional
alignment for two reasons that match what's surfacing here:

1. Stitching simplicity dominated the read code. The scatter
   path complicated vectorized batch alignment everywhere,
   not just in the stitching reader -- predicate pushdown,
   late materialization, and DV filter application all had to
   special-case it. Option 1's zero-copy column swap composed
   cleanly with the rest of the pipeline.

2. Trailing-deletion handling for positional alignment was
   solvable. We did the same thing Gabor described: broadcast
   a "file path -> row count" mapping to writers alongside
   the scan, so they could pad the tail with NULLs accurately.
   That worked; we never hit a case where it didn't.

On stats: NULL-filled deleted positions did skew Parquet footer
null_count, but it was an accounting fix, not a correctness
issue. The adjustment matches Gabor's proposal -- subtract the
delete count from the column null count. We considered it
tracked-and-acceptable, not a blocker.

== One extra cross-engine point ==

Adding to the arrow-rs / iceberg-rust portability point:
delta-kernel-rs (the Rust read path for Delta) is in the same
position. Same arrow-rs constraints, same lack of a scatter
primitive. If we picked Option 2, both Rust ecosystems would
face identical implementation pressure and would likely block
on the same upstream arrow-rs work. Option 1 lets each Rust
implementation follow arrow-rs idioms naturally, which seems
healthier than coordinating a cross-ecosystem dependency on a
new upstream primitive.

It's also worth a look at `iceberg-go` before locking the
decision -- Go's arrow ecosystem is even less mature than
arrow-rs for this kind of column manipulation, and Option 2
might land hardest there. arrow-go's compute surface is
materially smaller than arrow-rs's (no zip, no scatter, no
case-when at all), so on iceberg-go Option 2 isn't "harder by
some constant": it requires building a kernel that doesn't
exist in the ecosystem, before any iceberg-go integration code
is written. Option 1 reuses primitives that already ship
today, so something can be done in weeks.

== On _pos in positional-aligned files ==

Doubling Gabor's point: include _pos in positional-aligned files
regardless. <0.5% overhead in Parquet V2 is a cheap price for
debug visibility, and it keeps the door open if a future engine
wants to optimize specific read paths with scatter-style
stitching. Treat _pos as advisory metadata, not as the
authoritative row mapping.

Thanks for the clean write-up.

Andrei

On Wed, May 20, 2026 at 4:52 PM Gábor Kaszab <[email protected]> wrote:

> Hey Iceberg Community,
>
> Thanks Anurag for starting a dedicated thread on this! Just a couple of
> thoughts from my side:
>
> *Storage overhead:*
> *TLDR:* Regardless which way we go, storage overhead shouldn't be an
> issue.
>
> *Details:*
> I made a couple of measurements recently on the storage size of the update
> files. For numbers, see the "Update file size measurements" tab in this
> doc
> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing>
> .
>
> 1) Writing _pos column:
>
>    - File formats could hide the overhead for the _pos column, e.g.
>    Parquet V2 has negligible (<0,5%) overhead
>    - I recall there was an argument that having _pos even for pos-aligned
>    updates might help us debug writer issues to see what is missing/icorrect
>    in the update file
>    - On the write path we already read and sort by _pos so there should
>    be no overhead either
>
> In summary, I slightly lean towards always having _pos in the updates
> regardless what representation we choose.
> If we can agree on this, the relevant pros and cons around _pos are
> obsolete.
>
> 2) Filling rows/values for deleted rows:
>
>    - I measured noticeable overhead when filling fields for deleted rows
>    with NULLs. As always the overhead depends on many stuff
>    - Deleted 5% of rows => writing NULLs has 2-3% storage overhead
>       - Deleted 20% of rows => writing NULLs has 4-4.6% overhead
>    - Since the consensus is constantly eliminating deletes, I think this
>    overhead is acceptable
>
> *Factors to consider to choose representation*
> As described above, the storage size shouldn't be a factor here, I'd not
> consider the presence of _pos as pro or con.
>
> If we scratch the storage size related ones, here is what we have left:
>
> *1) Accuracy of stats*
> By filling missing rows with auxiliary values the stats can go off
>
>    - I'm not that worried about Parquet footer stats, they represent the
>    file itself, not the logical deletes on top
>    - If we went for NULLs as filling values, we could set each field in a
>    deleted row to NULL (not the entire row). As a result the column-level null
>    count in the table metadata can be off because of the filling values.
>    - Technically, if we want to, we can correct these stats if we
>    collected the number of deleted rows and then adjust the null count by this
>    number
>    - In general, deletes make stats off anyway regardless of column
>    updates (probably not the null count but the avg size, though)
>
> I don't think that this should be a decisive factor when choosing
> representation.
>
> *2) Complexity in general*
>     a) Read - stitching
>
>    - Positional alignment approach is straightforward for stitching both
>    in vectorized and row readers
>       - Stitch before applying deletes
>    - Applied deletes approach seems more complicated initially
>       - Scattering of update rows is required
>       - Might not be supported for all the language implementations ATM.
>       Thanks Anurag for taking a look!
>
>     b) Write
>
>    - Applied deletes approach is straightforward
>    - Positional alignment approach has one major complexity:
>       - when trailing rows are deleted, the writer currently has no
>       information how many rows to fill
>       - In the PoC we broadcast a "file path to row count" so that the
>       writers can now if trailing rows have to be filled with nulls, and with 
> how
>       many (comparing to the _pos column)
>       - In theory we could simply not fill trailing deleted rows, but
>       then we have a hybrid approach between positional alignment and applied
>       deletes. Probably, we don't want this complexity in the spec
>
> *3) Read and write performance*
>
>    - I'm not expecting any difference in write perf
>    - Read could have a toll on the "applied deletes" approach due to
>    scattering. *@pvary* might have some more insights here.
>
> *Summary*
> I hope this summarizes all we have discussed from a different angle and
> might narrow down the areas to look for. I think the two main questions to
> sort out to make a decision are:
>     1) Can we find a way to take care of trailing deletes in "positional
> aligned" approach (or are we fine not filling trailing deletes)
>     2) What is the cost of scattering the update rows in the "applied
> deletes" approach
>         2/b) Is scattering feasible on all language implementations
>
> Best Regards,
> Gabor
>
>
> Anurag Mantripragada <[email protected]> ezt írta (időpont:
> 2026. máj. 20., Sze, 2:37):
>
>> Hi all,
>>
>> Following up on the column updates design
>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
>>  and
>> the original discussion thread
>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd
>> like to start a focused discussion on how column update files should
>> represent rows when deletion vectors (DVs) are present.
>>
>> *Context*
>>
>> We've reached consensus on using a dense representation for column
>> update files. When a column is updated, the column file contains values for
>> all rows including unchanged rows. This avoids complex merge logic on the
>> write path when successive updates target overlapping fields.
>>
>> The open question is: what should the column file contain at positions
>> where the base file has deleted rows? There are two options.
>>
>> *Option 1*: Positional Alignment (row count matches base file)
>>
>> The column file has exactly base_file.record_count rows. Row N in the
>> column file corresponds to row N in the base file. Deleted positions
>> contain filler values (e.g., NULLs).
>>
>> Pros*:*
>>
>>    - Stitching is a zero-copy column swap in Arrow
>>    - Works identically in every Arrow implementation (Java, Rust,
>>    Python, C++)
>>    - No _pos column needed
>>    - Engines apply their existing DV filter to both base and column file
>>
>> Cons*:*
>>
>>    - Filler values at deleted positions skew Parquet footer statistics
>>    (null_count, avg_length)
>>    - Writes slightly more data than necessary (filler values for deleted
>>    rows)
>>    - Writer must know base_file.record_count to pad trailing deletions
>>    (base file metadata already available during write planning)
>>
>> *Option 2*: Applied Deletes (row count = live rows only)
>>
>> The column file contains only live rows (after applying DVs). A _pos column
>> maps each row back to its ordinal position in the base file.
>>
>> Pros*:*
>>
>>    - Only stores valid rows in column update files.
>>    - Parquet footer statistics are accurate (no skew from NULLs at
>>    deleted positions)
>>    - Slightly smaller file (no filler bytes)
>>
>> Cons*:*
>>
>>    - _pos adds storage overhead (Encoding must be left to the file
>>    format)
>>    - Stitching requires a scatter operation to allocate a new array and
>>    place values at the correct positions
>>    - It's not zero-copy in Arrow and requires manipulation.
>>    - As it stands today this might be  harder for non-Java engines (see
>>    section below)
>>
>> I investigated how three Iceberg implementations handle vectorized
>> reading and what column stitching would require in each. The key
>> architectural difference is how they expose Arrow memory:
>>
>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can
>> override getInt(rowId)to redirect reads without copying data. This makes
>> scatter operations appear "free" via virtual dispatch. My POC uses this
>> approach.
>>
>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
>> to override what array[i] returns. PyArrow has take() (gather) but lacks
>> a scatter() primitive (in the  version we use).
>>
>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not
>> trait objects). Int32Array::value(i) is a direct memory offset. Must
>> materialize new arrays via ArrayBuilder for any non-trivial column
>> manipulation.
>>
>> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter
>> method to stitch column files. I found the following implementations in
>> Arrow which can be used to stitch.
>>
>>
>>    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow
>>    20.0.0)
>>
>>    - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
>>    23.0.0)
>>    - I did not find scatter in arrow-rs.
>>
>> I'm still researching these options and would love to hear from everyone.
>>
>> Thanks,
>> Anurag
>>
>>

Re: [Discuss] Column Update File Representation

Reply via email to