Re: [Discuss] Column Update File Representation

Gábor Kaszab Wed, 20 May 2026 08:36:43 -0700

Minor addition to the storage overhead with writing NULLs as filling values:
If we assume Parquet V2 and we write the _pos to the update file, with 20%
of the rows deleted will totally amortize the storage overhead of writing
NULLs. The reason is that with the "positional aligned" updates we have a
complete sequence of the _pos that is efficiently compressed with delta
encoding, while with the "applied deletes" approach there are holes in the
_pos sequence, hence delta encoding is slightly less efficient.


Just some number to visualize with 20% deletion ratio:
1 int col _pos + 1 int
Omit deleted rows 9735745 10111232
Null deleted rows 10189522 10205497
Overhead 4,66% 0,93%

Gabor


Gábor Kaszab <[email protected]> ezt írta (időpont: 2026. máj. 20.,
Sze, 16:50):

> Hey Iceberg Community,
>
> Thanks Anurag for starting a dedicated thread on this! Just a couple of
> thoughts from my side:
>
> *Storage overhead:*
> *TLDR:* Regardless which way we go, storage overhead shouldn't be an
> issue.
>
> *Details:*
> I made a couple of measurements recently on the storage size of the update
> files. For numbers, see the "Update file size measurements" tab in this
> doc
> <https://docs.google.com/spreadsheets/d/1I5u72D-4LbIs7p7lBnf5ITZou4V9jdHAj9z9p_nUO0Q/edit?usp=sharing>
> .
>
> 1) Writing _pos column:
>
>    - File formats could hide the overhead for the _pos column, e.g.
>    Parquet V2 has negligible (<0,5%) overhead
>    - I recall there was an argument that having _pos even for pos-aligned
>    updates might help us debug writer issues to see what is missing/icorrect
>    in the update file
>    - On the write path we already read and sort by _pos so there should
>    be no overhead either
>
> In summary, I slightly lean towards always having _pos in the updates
> regardless what representation we choose.
> If we can agree on this, the relevant pros and cons around _pos are
> obsolete.
>
> 2) Filling rows/values for deleted rows:
>
>    - I measured noticeable overhead when filling fields for deleted rows
>    with NULLs. As always the overhead depends on many stuff
>    - Deleted 5% of rows => writing NULLs has 2-3% storage overhead
>       - Deleted 20% of rows => writing NULLs has 4-4.6% overhead
>    - Since the consensus is constantly eliminating deletes, I think this
>    overhead is acceptable
>
> *Factors to consider to choose representation*
> As described above, the storage size shouldn't be a factor here, I'd not
> consider the presence of _pos as pro or con.
>
> If we scratch the storage size related ones, here is what we have left:
>
> *1) Accuracy of stats*
> By filling missing rows with auxiliary values the stats can go off
>
>    - I'm not that worried about Parquet footer stats, they represent the
>    file itself, not the logical deletes on top
>    - If we went for NULLs as filling values, we could set each field in a
>    deleted row to NULL (not the entire row). As a result the column-level null
>    count in the table metadata can be off because of the filling values.
>    - Technically, if we want to, we can correct these stats if we
>    collected the number of deleted rows and then adjust the null count by this
>    number
>    - In general, deletes make stats off anyway regardless of column
>    updates (probably not the null count but the avg size, though)
>
> I don't think that this should be a decisive factor when choosing
> representation.
>
> *2) Complexity in general*
>     a) Read - stitching
>
>    - Positional alignment approach is straightforward for stitching both
>    in vectorized and row readers
>       - Stitch before applying deletes
>    - Applied deletes approach seems more complicated initially
>       - Scattering of update rows is required
>       - Might not be supported for all the language implementations ATM.
>       Thanks Anurag for taking a look!
>
>     b) Write
>
>    - Applied deletes approach is straightforward
>    - Positional alignment approach has one major complexity:
>       - when trailing rows are deleted, the writer currently has no
>       information how many rows to fill
>       - In the PoC we broadcast a "file path to row count" so that the
>       writers can now if trailing rows have to be filled with nulls, and with 
> how
>       many (comparing to the _pos column)
>       - In theory we could simply not fill trailing deleted rows, but
>       then we have a hybrid approach between positional alignment and applied
>       deletes. Probably, we don't want this complexity in the spec
>
> *3) Read and write performance*
>
>    - I'm not expecting any difference in write perf
>    - Read could have a toll on the "applied deletes" approach due to
>    scattering. *@pvary* might have some more insights here.
>
> *Summary*
> I hope this summarizes all we have discussed from a different angle and
> might narrow down the areas to look for. I think the two main questions to
> sort out to make a decision are:
>     1) Can we find a way to take care of trailing deletes in "positional
> aligned" approach (or are we fine not filling trailing deletes)
>     2) What is the cost of scattering the update rows in the "applied
> deletes" approach
>         2/b) Is scattering feasible on all language implementations
>
> Best Regards,
> Gabor
>
>
> Anurag Mantripragada <[email protected]> ezt írta (időpont:
> 2026. máj. 20., Sze, 2:37):
>
>> Hi all,
>>
>> Following up on the column updates design
>> <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?tab=t.0#heading=h.b3mc4alqde65>
>>  and
>> the original discussion thread
>> <https://lists.apache.org/thread/w90rqyhmh6pb0yxp0bqzgzk1y1rotyny>, I'd
>> like to start a focused discussion on how column update files should
>> represent rows when deletion vectors (DVs) are present.
>>
>> *Context*
>>
>> We've reached consensus on using a dense representation for column
>> update files. When a column is updated, the column file contains values for
>> all rows including unchanged rows. This avoids complex merge logic on the
>> write path when successive updates target overlapping fields.
>>
>> The open question is: what should the column file contain at positions
>> where the base file has deleted rows? There are two options.
>>
>> *Option 1*: Positional Alignment (row count matches base file)
>>
>> The column file has exactly base_file.record_count rows. Row N in the
>> column file corresponds to row N in the base file. Deleted positions
>> contain filler values (e.g., NULLs).
>>
>> Pros*:*
>>
>>    - Stitching is a zero-copy column swap in Arrow
>>    - Works identically in every Arrow implementation (Java, Rust,
>>    Python, C++)
>>    - No _pos column needed
>>    - Engines apply their existing DV filter to both base and column file
>>
>> Cons*:*
>>
>>    - Filler values at deleted positions skew Parquet footer statistics
>>    (null_count, avg_length)
>>    - Writes slightly more data than necessary (filler values for deleted
>>    rows)
>>    - Writer must know base_file.record_count to pad trailing deletions
>>    (base file metadata already available during write planning)
>>
>> *Option 2*: Applied Deletes (row count = live rows only)
>>
>> The column file contains only live rows (after applying DVs). A _pos column
>> maps each row back to its ordinal position in the base file.
>>
>> Pros*:*
>>
>>    - Only stores valid rows in column update files.
>>    - Parquet footer statistics are accurate (no skew from NULLs at
>>    deleted positions)
>>    - Slightly smaller file (no filler bytes)
>>
>> Cons*:*
>>
>>    - _pos adds storage overhead (Encoding must be left to the file
>>    format)
>>    - Stitching requires a scatter operation to allocate a new array and
>>    place values at the correct positions
>>    - It's not zero-copy in Arrow and requires manipulation.
>>    - As it stands today this might be  harder for non-Java engines (see
>>    section below)
>>
>> I investigated how three Iceberg implementations handle vectorized
>> reading and what column stitching would require in each. The key
>> architectural difference is how they expose Arrow memory:
>>
>> * Java/Spark**:* Spark's ColumnVector is an abstract class. We can
>> override getInt(rowId)to redirect reads without copying data. This makes
>> scatter operations appear "free" via virtual dispatch. My POC uses this
>> approach.
>>
>> *PyIceberg:* Uses PyArrow's native arrays. I could not find a way
>> to override what array[i] returns. PyArrow has take() (gather) but lacks
>> a scatter() primitive (in the  version we use).
>>
>> *iceberg-rust:* Uses arrow-rs arrays, which are concrete structs (not
>> trait objects). Int32Array::value(i) is a direct memory offset. Must
>> materialize new arrays via ArrayBuilder for any non-trivial column
>> manipulation.
>>
>> TL;DR: If we choose Option 2 (applied deletes), engines need a scatter
>> method to stitch column files. I found the following implementations in
>> Arrow which can be used to stitch.
>>
>>
>>    - C++ <https://github.com/apache/arrow/pull/44394> (Since Arrow
>>    20.0.0)
>>
>>    - Python <https://github.com/apache/arrow/pull/48267> (Since Arrow
>>    23.0.0)
>>    - I did not find scatter in arrow-rs.
>>
>> I'm still researching these options and would love to hear from everyone.
>>
>> Thanks,
>> Anurag
>>
>>

Re: [Discuss] Column Update File Representation

Reply via email to