Re: [Discuss] Column Update File Representation

Steven Wu Thu, 25 Jun 2026 15:09:01 -0700

> because column files are short-lived. Compaction rewrites them back into
the base files regularly, so there is no long-lived dense corpus to migrate.


I don't necessarily agree that column files are short lived. There is a
discussion about column families that has been punted for later. Separate
column files for column families can be a desired and optimal long-term
state.

I would also favor starting with the dense representation (fillter values
for delete rows).


On Thu, Jun 25, 2026 at 2:58 PM Andrei Tserakhau via dev <
[email protected]> wrote:

> +1 on picking dense as the single representation now, rather than leaving
> it up to the engine.
>
> The reason I'd mandate it, not just allow it, is the asymmetry. Dense is
> the special case of sparse, so mandating the special case is the smallest
> thing every reader has to implement: a positional substitution, no scatter,
> no merge-on-read stacks. And it covers the dominant workload directly -
> refreshing a whole column (a new column from an expression, or overwriting
> an existing column with new values like embeddings or model weights) is
> full-coverage by nature, not a point update to a few rows.
>
> Key thing here is that going dense now does not close the door on sparse.
>
> A sparse-capable reader is a superset of a dense one - it can read dense
> files too, since full coverage is just sparse with every position present.
> So adding sparse later is an additive format version: it widens the reader,
> it does not break existing dense files, and tables that never need sparse
> never pay for it. The reverse is not true. If we allow sparse now, every
> engine and client has to implement the harder merge-on-read path from day
> one, for row-level partial updates the current workloads are not asking for.
>
> On whether dense is a one-way door: I don't think it is, because column
> files are short-lived. Compaction rewrites them back into the base files
> regularly, so there is no long-lived dense corpus to migrate. If we add
> sparse later, old dense files age out through normal compaction, or we
> rewrite them - and because they are transient, that cost is bounded and
> amortized, not a table-wide migration.
>
> So my preference is to specify dense now and keep sparse as a documented
> future extension with its own format version, rather than leaving the
> representation unspecified. Leaving it open is the worst of the three: as
> Peter pointed out, it forces every reader to support sparse anyway, which
> is the exact cost we would be trying to defer.
>
> Best,
> Andrei
>
> On Thu, Jun 25, 2026 at 7:58 PM Steven Wu <[email protected]> wrote:
>
>> > We can, and in the PoC this is what we do, broadcast the "location ->
>> record count" mapping to the writers for this.
>>
>> I am wondering if column file generation usually needs to scan the
>> existing base files (or column files) anyway. Otherwise, a default value
>> column (with expressions) should probably be sufficient. So the writer
>> probably already has the data file metadata.
>>
>> Plus, carrying over additional contextual information (like manifest file
>> location and entry position) is very beneficial, as the snapshot producer
>> can generate manifest DVs efficiently without scanning manifest files to
>> locate the old TrackedFile entry to delete (maybe via manifest DV).
>>
>> > Alternatively, when scanning inputs for the writers, we can also query
>> the '_deleted' metadata column.
>>
>> I agree; this is another nice way to solve this problem assuming the base
>> file or older column files need to be scanned.
>>
>> On Wed, Jun 24, 2026 at 11:15 PM Gábor Kaszab <[email protected]>
>> wrote:
>>
>>> Hi All,
>>>
>>> I share Steven's opinion that the cons against the dense representation
>>> aren't that strong, and the implementation seems more straightforward
>>> across projects and languages, if we keep an invariant to have all the rows
>>> (even the deleted ones with auxiliary values) in the column file.
>>>
>>> 1) Off stats
>>> I can just +1 on the stats part. They can be "fixed" to not go off
>>> caused by the filler values, but the stats are off already anyway due to
>>> deletes, so not sure if this is something we want to fix.
>>>
>>> 2) More data due to filler values
>>> TLDR: there is no significant difference between sparse and dense
>>> in storage size
>>>
>>> The reason is the compression efficiency for the _pos column. I made
>>> some experiments on this front and encodings can help the dense
>>> representation. The more rows we delete, the more auxiliary values we have
>>> to use with the dense representation, this is true. On the other hand, the
>>> more rows we delete the worse the compression of the _pos column is for the
>>> sparse representation (assuming Parquet V2) due to holes in the sequence.
>>> The overhead of the missing positions for sparse seems to balance out
>>> the overhead of the presence of auxiliary values for dense.
>>>
>>> 3) We have to know the record count of the base files in the writer
>>> I don't think this is an available information now in the writer. We
>>> can, and in the PoC this is what we do, broadcast the "location -> record
>>> count" mapping to the writers for this.
>>>
>>> Alternatively, when scanning inputs for the writers, we can also query
>>> the '_deleted' metadata column. Using that we don't even have to broadcast
>>> the record counts.
>>>
>>> Summary:
>>> I think none of the cons for dense are deal breakers and I'm in favor of
>>> supporting a single representation. My preference is dense.
>>>
>>> Best Regards,
>>> Gabor
>>>
>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. jún. 24.,
>>> Sze, 22:59):
>>>
>>>> I agree with Peter's points here. While it seems flexible to have both
>>>> optioins, it essentially requires every engine/client to implement the more
>>>> complex read of sparse representation.
>>>>
>>>> I want to revisit the cons that Anurag summarized for the option 1
>>>> (filler values for deleted rows). To me, those arguments against filler
>>>> values seem relatively weak, and the pros (zero-copy stitching, simpler
>>>> reader implementation) outweigh the cons.
>>>>
>>>> > Filler values at deleted positions skew Parquet footer statistics
>>>> (null_count, avg_length)
>>>>
>>>> Writers can produce accurate statistics in the Iceberg metadata even
>>>> with filler values. I know the Java reference implementation currently just
>>>> takes the column stats from the Parquet writer. But some writer
>>>> implementations may choose to produce accurate stats in this case.
>>>>
>>>> There was also a concern that differing column statistics between
>>>> Iceberg metadata and the Parquet footer, caused by DVs, could be confusing.
>>>> I want to argue that this difference is actually reasonable.  DV is a table
>>>> level concept. With DVs, Iceberg metadata can have different and adjusted
>>>> column stats compared to the Parquet footer. Parquet is not aware of DVs,
>>>> and the Parquet footer only captures the stats for the content in the
>>>> physical file.
>>>>
>>>> Today, we already have inaccurate stats with DVs. It is not a
>>>> correctness problem, it may have a small performance impact on pruning.
>>>> Even if writer implementations do nothing special for column files, we are
>>>> no worse off than today.
>>>>
>>>> > Writes slightly more data than necessary (filler values for deleted
>>>> rows)
>>>>
>>>> This depends on the percentage of deleted rows. Sparse representation
>>>> also has some small overhead for storing the encoded positions (even with
>>>> delta encoding).
>>>>
>>>> > Writer must know base_file.record_count to pad trailing deletions
>>>> (base file metadata already available during write planning)
>>>>
>>>> As already pointed out, the base file metadata already has the row
>>>> count. so it is not really a problem
>>>>
>>>> On Tue, Jun 23, 2026 at 9:05 AM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> I still have concerns with this decision:
>>>>> > Implementation Details: Specific writer implementation details such
>>>>> as choosing between dense or sparse representations will be left to
>>>>> individual engines.
>>>>> > Specification Scope: The specification will not mandate these
>>>>> internal implementation choices, provided that engines adhere to writing
>>>>> the explicit *_pos* column.
>>>>>
>>>>> If we do not specify whether the representation should be dense or
>>>>> sparse, we are effectively requiring all engines to support the sparse
>>>>> representation, since the dense representation is just a special case of
>>>>> the sparse one.
>>>>> In practice, this means every implementation must be able to
>>>>> materialize a dense representation from the sparse form, similar to what
>>>>> the current Spark implementation does today. While this is certainly
>>>>> feasible, it introduces an additional step on the read path, which is 
>>>>> often
>>>>> performance-sensitive. This concern has been raised consistently by
>>>>> representatives of other Iceberg implementations, and I have not heard a
>>>>> different perspective from them so far.
>>>>>
>>>>> That said, if the broader group is comfortable accepting this
>>>>> trade-off, I do not have any further objections to the proposal.
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> Anurag Mantripragada <[email protected]> ezt írta
>>>>> (időpont: 2026. jún. 16., K, 20:51):
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> It seems this thread has become conflated with the metadata
>>>>>> representation discussion
>>>>>> <https://lists.apache.org/thread/7jryw9dfvc02s411twn4o7s5gjrybfxg>.
>>>>>> While all the points raised here are noted, let’s continue those specific
>>>>>> parts of the conversation in the metadata thread.
>>>>>>
>>>>>> Regarding data representation, we discussed the following during this
>>>>>> <https://www.youtube.com/watch?v=kuxFBm-j5hw&t=3s> sync:
>>>>>>
>>>>>>    -  Implementation Details: Specific writer implementation details
>>>>>>    such as choosing between dense or sparse representations will be left 
>>>>>> to
>>>>>>    individual engines.
>>>>>>    -  Specification Scope: The specification will not mandate these
>>>>>>    internal implementation choices, provided that engines adhere to 
>>>>>> writing
>>>>>>    the explicit *_pos* column.
>>>>>>
>>>>>> Please let me know if you have concerns.
>>>>>>
>>>>>> ~ Anurag
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 2, 2026 at 11:44 AM Xiening Dai <[email protected]> wrote:
>>>>>>
>>>>>>> We also need to think about the DV only case.
>>>>>>>
>>>>>>> If we have f0 with dv0, then we do column update and generate f1. Do
>>>>>>> we also bump the sequence number for f0 in this case? There are multiple
>>>>>>> options:
>>>>>>>
>>>>>>> 1) We bump the sequence number, then we will need to copy dv0 into
>>>>>>> dv1 and assign the same sequence number to dv1 so that the delete 
>>>>>>> positions
>>>>>>> won't get lost.
>>>>>>> 2) We don't bump the sequence number, then we don't need to re-write
>>>>>>> dv0 and everything would remain working. But this creates a small
>>>>>>> inconsistency with eq delete case, and requires a special case handling 
>>>>>>> at
>>>>>>> write path.
>>>>>>> 3) We bump sequence number for both data file f0, and dv0. We don't
>>>>>>> need to rewrite dv, but instead we bump the sequence number for the dv 
>>>>>>> as
>>>>>>> well.
>>>>>>>
>>>>>>> I'd suggest we write down these details into a spec change proposal
>>>>>>> and examine the read write work flow carefully.
>>>>>>>
>>>>>>> On 2026/06/02 12:42:10 Gábor Kaszab wrote:
>>>>>>> > Thanks for the summary, Amogh!
>>>>>>> >
>>>>>>> > I think the missing building block to make this eq-delete rewrite
>>>>>>> work is
>>>>>>> > the decision made yesterday, to bump the base file-level sequence
>>>>>>> number
>>>>>>> > when adding a column file. With this, we can make sure that after
>>>>>>> we have
>>>>>>> > rewritten the eq-deletes into DVs in the process of adding column
>>>>>>> files, we
>>>>>>> > don't have to apply the eq-deletes we had previously on the base
>>>>>>> file.
>>>>>>> >
>>>>>>> > Just some thoughts on implementation:
>>>>>>> >
>>>>>>> >    - Write path in general: When writing the update file, we
>>>>>>> designed this
>>>>>>> >    in the PoC to receive _path and _pos from the base file. With
>>>>>>> this we can
>>>>>>> >    identify if some positions are missing and we can convert them
>>>>>>> into DVs
>>>>>>> >    - Trailing deletes: The tricky part is when trailing rows are
>>>>>>> deleted. I
>>>>>>> >    see 2 approaches to get around this:
>>>>>>> >       - Broadcast base file row counts to writers (this is done by
>>>>>>> the
>>>>>>> >       PoC): When we received the last row from the base file with
>>>>>>> pos X, but we
>>>>>>> >       know there are more rows in the base file, we have to add
>>>>>>> the trailing
>>>>>>> >       positions to the DV
>>>>>>> >       - Enrich the input rows fed to the writer with the "_deleted"
>>>>>>> >       metadata column. False => write to update file, true =>
>>>>>>> write pos to DV
>>>>>>> >
>>>>>>> > Regards,
>>>>>>> > Gabor
>>>>>>> >
>>>>>>> > Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2026. jún.
>>>>>>> 1., H,
>>>>>>> > 22:48):
>>>>>>> >
>>>>>>> > > >The real challenge comes from the read path. In the case when
>>>>>>> we have a
>>>>>>> > > data file f0, an equality delete file d0, and column file f1,
>>>>>>> and the
>>>>>>> > > materialized dv d1. How do we reconcile the deletes during read?
>>>>>>> If we
>>>>>>> > > don't do anything special, following the existing spec (based on
>>>>>>> sequence
>>>>>>> > > number rule), we would apply d0 on f0, and then apply d1 on f1,
>>>>>>> which
>>>>>>> > > should still give us the correct results as both d0 and d1
>>>>>>> represent the
>>>>>>> > > same set of positions. But this is undesired because we dont
>>>>>>> want to load
>>>>>>> > > and re-evaluate the old column values. So we need a change in
>>>>>>> the spec so
>>>>>>> > > that in this scenario the new d1 supersede the existing equality
>>>>>>> delete
>>>>>>> > > file (d0).
>>>>>>> > >
>>>>>>> > > So given the following invariants/rules:
>>>>>>> > >
>>>>>>> > > 1. In a dense representation, column updates must carry over all
>>>>>>> active
>>>>>>> > > values for the column (and there's a _pos column referencing the
>>>>>>> position
>>>>>>> > > from the original base file).
>>>>>>> > > 2. Column updates must know what rows were deleted (either to
>>>>>>> omit the row
>>>>>>> > > or materialize the default value)
>>>>>>> > > 3. Data sequence numbers are updated on column appends/updates
>>>>>>> (this would
>>>>>>> > > be a spec change in v4). I think reusing the same seq. number is
>>>>>>> key since
>>>>>>> > > we don't have a different sequence number definition that's
>>>>>>> temporal in
>>>>>>> > > dimension for delete matching and another one that's not
>>>>>>> temporal but for
>>>>>>> > > column updates. Having a single sequence number simplifies a lot
>>>>>>> of this.
>>>>>>> > > 4. The requirement that a column update must also rewrite
>>>>>>> existing
>>>>>>> > > equality deletes into DV
>>>>>>> > >
>>>>>>> > > I think this combination (and the fact that DVs are 1:1 to with
>>>>>>> data
>>>>>>> > > files) naturally addresses this because
>>>>>>> > > f1 in this example would have the column values for all the
>>>>>>> active rows.
>>>>>>> > > Then the DV v1 just deletes row positions as usual. There's
>>>>>>> never a need to
>>>>>>> > > actually read the old column values in this model.
>>>>>>> > >
>>>>>>> > > There's a broader discussion around eliminating new equality
>>>>>>> deletes in v4
>>>>>>> > > but in that case this rule would still apply to handle older
>>>>>>> equality
>>>>>>> > > deletes from v3 and earlier + column updates on older data files
>>>>>>> as well.
>>>>>>> > >
>>>>>>> > > We actually talked about this a bit in todays v4 amt sync
>>>>>>> > > <https://youtu.be/7mVes-6pM1c?t=861>
>>>>>>> > >
>>>>>>> > > Thanks,
>>>>>>> > > Amogh Jahagirdar
>>>>>>> > >
>>>>>>> > > On Mon, Jun 1, 2026 at 12:17 PM Xiening Dai <[email protected]>
>>>>>>> wrote:
>>>>>>> > >
>>>>>>> > >> > but we should develop some concreteness around how feasible
>>>>>>> it is for
>>>>>>> > >> engines to produce the DVs on the column update.
>>>>>>> > >>
>>>>>>> > >> Actually I don't think this would be a problem. As mentioned,
>>>>>>> in order to
>>>>>>> > >> generate correct column file, we already need to product the
>>>>>>> correct set of
>>>>>>> > >> deleted positions, and we just need an extra step to
>>>>>>> materialize these
>>>>>>> > >> positions into DV.
>>>>>>> > >>
>>>>>>> > >> The real challenge comes from the read path. In the case when
>>>>>>> we have a
>>>>>>> > >> data file f0, an equality delete file d0, and column file f1,
>>>>>>> and the
>>>>>>> > >> materialized dv d1. How do we reconcile the deletes during
>>>>>>> read? If we
>>>>>>> > >> don't do anything special, following the existing spec (based
>>>>>>> on sequence
>>>>>>> > >> number rule), we would apply d0 on f0, and then apply d1 on f1,
>>>>>>> which
>>>>>>> > >> should still give us the correct results as both d0 and d1
>>>>>>> represent the
>>>>>>> > >> same set of positions. But this is undesired because we dont
>>>>>>> want to load
>>>>>>> > >> and re-evaluate the old column values. So we need a change in
>>>>>>> the spec so
>>>>>>> > >> that in this scenario the new d1 supersede the existing
>>>>>>> equality delete
>>>>>>> > >> file (d0).
>>>>>>> > >>
>>>>>>> > >> On 2026/05/29 23:21:33 Amogh Jahagirdar wrote:
>>>>>>> > >> > One approach that’s helped me reason about all this is to
>>>>>>> treat each
>>>>>>> > >> base
>>>>>>> > >> > file as its own little mini‑table inside the larger table:
>>>>>>> the row
>>>>>>> > >> range of
>>>>>>> > >> > the base file keyed by row_id, and column files/deletes just
>>>>>>> layer on
>>>>>>> > >> top.Once
>>>>>>> > >> > a row is deleted in that mini‑table, it stays deleted in that
>>>>>>> > >> mini‑table’s
>>>>>>> > >> > state (whether that’s via equality deletes, or DVs), and
>>>>>>> column updates
>>>>>>> > >> are
>>>>>>> > >> > just layering changed or additional columns on top of
>>>>>>> whatever rowsare
>>>>>>> > >> > still there. Then I can reason about "what are desirable
>>>>>>> properties of
>>>>>>> > >> this
>>>>>>> > >> > mini-table".
>>>>>>> > >> >
>>>>>>> > >> > Once I look at it that way, stacking equality deletes with
>>>>>>> column
>>>>>>> > >> updates
>>>>>>> > >> > on the same column, and then forcing the write path to read
>>>>>>> all the
>>>>>>> > >> older
>>>>>>> > >> > column files when producing new column updates, feels like
>>>>>>> the worst
>>>>>>> > >> > outcome; and it gets worse the more column updates there are
>>>>>>> for the
>>>>>>> > >> > column. It blows up complexity and performance and
>>>>>>> compromises the
>>>>>>> > >> value of
>>>>>>> > >> > efficient column updates.
>>>>>>> > >> >
>>>>>>> > >> > If we eliminate that option, I think we’re left with two
>>>>>>> high‑level
>>>>>>> > >> > approaches:
>>>>>>> > >> >
>>>>>>> > >> >    1. Equality deletes cannot be allowed with column updates.
>>>>>>> This
>>>>>>> > >> >    simplifies both the read and write paths when column
>>>>>>> update files are
>>>>>>> > >> >    present. I would generally prefer this option but there is
>>>>>>> a
>>>>>>> > >> legitimate
>>>>>>> > >> >    problem around the “how” for checking for the presence
>>>>>>> equality
>>>>>>> > >> deletes. We
>>>>>>> > >> >    can’t rely on snapshot summaries, which means we’d have to
>>>>>>> look at
>>>>>>> > >> delete
>>>>>>> > >> >    manifests to really know if equality deletes exist. There
>>>>>>> were ideas
>>>>>>> > >> in the
>>>>>>> > >> >    V4 AMT sync about constraining equality deletes to be in
>>>>>>> the root
>>>>>>> > >> manifest;
>>>>>>> > >> >    in that model, the amount of work needed to check for
>>>>>>> equality
>>>>>>> > >> deletes is
>>>>>>> > >> >    bounded by the root size. I’d keep that as a separate open
>>>>>>> question
>>>>>>> > >> because
>>>>>>> > >> >    there are other challenges with requiring equality deletes
>>>>>>> to only
>>>>>>> > >> appear
>>>>>>> > >> >    in the root manifest, especially on the upgrade path.
>>>>>>> > >> >    2. After an equality delete, subsequent updates must
>>>>>>> produce a DV. As
>>>>>>> > >> >    Xiening highlighted, once you’ve had an equality delete on
>>>>>>> a column,
>>>>>>> > >> any
>>>>>>> > >> >    subsequent updates on that column would be required to
>>>>>>> produce a DV
>>>>>>> > >> (or
>>>>>>> > >> >    positional delete) for the deleted positions at the new
>>>>>>> sequence
>>>>>>> > >> number,
>>>>>>> > >> >    making the original equality delete obsolete. This is
>>>>>>> attractive
>>>>>>> > >> because
>>>>>>> > >> >    it’s not too constraining for writers: they’re already
>>>>>>> doing the
>>>>>>> > >> work of
>>>>>>> > >> >    reconciling deleted positions to decide what to write into
>>>>>>> the
>>>>>>> > >> column file,
>>>>>>> > >> >    so the additional work is basically emitting the DV. The
>>>>>>> main thing
>>>>>>> > >> to
>>>>>>> > >> >    think through is how exactly the plumbing to engines
>>>>>>> looks, but in
>>>>>>> > >> theory
>>>>>>> > >> >    it’s just a matter of plumbing through explicitly deleted
>>>>>>> positions
>>>>>>> > >> (or,
>>>>>>> > >> >    less ideally, inferring them from a sentinel value in the
>>>>>>> tuple).
>>>>>>> > >> >
>>>>>>> > >> >
>>>>>>> > >> > So far I’m leaning towards option 2, but we should develop
>>>>>>> some
>>>>>>> > >> > concreteness around how feasible it is for engines to produce
>>>>>>> the DVs on
>>>>>>> > >> > the column update. Again, should all be theoretically
>>>>>>> possible based off
>>>>>>> > >> > plumbing deleted positions; we shouldn't let implementations
>>>>>>> drive the
>>>>>>> > >> spec
>>>>>>> > >> > but I think sniff testing the practicality of it is well
>>>>>>> worth it to
>>>>>>> > >> make
>>>>>>> > >> > sure that restriction is reasonably implementable.
>>>>>>> > >> >
>>>>>>> > >> > Interested in hearing what others think about this one.
>>>>>>> > >> >
>>>>>>> > >> >
>>>>>>> > >> > Thanks,
>>>>>>> > >> >
>>>>>>> > >> > Amogh Jahagirdar
>>>>>>> > >> >
>>>>>>> > >>
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>

Re: [Discuss] Column Update File Representation

Reply via email to