Re: [Discuss] Column Update File Representation

Steven Wu Thu, 25 Jun 2026 10:58:08 -0700

> We can, and in the PoC this is what we do, broadcast the "location ->
record count" mapping to the writers for this.


I am wondering if column file generation usually needs to scan the existing
base files (or column files) anyway. Otherwise, a default value column
(with expressions) should probably be sufficient. So the writer probably
already has the data file metadata.

Plus, carrying over additional contextual information (like manifest file
location and entry position) is very beneficial, as the snapshot producer
can generate manifest DVs efficiently without scanning manifest files to
locate the old TrackedFile entry to delete (maybe via manifest DV).

> Alternatively, when scanning inputs for the writers, we can also query
the '_deleted' metadata column.

I agree; this is another nice way to solve this problem assuming the base
file or older column files need to be scanned.

On Wed, Jun 24, 2026 at 11:15 PM Gábor Kaszab <[email protected]>
wrote:

> Hi All,
>
> I share Steven's opinion that the cons against the dense representation
> aren't that strong, and the implementation seems more straightforward
> across projects and languages, if we keep an invariant to have all the rows
> (even the deleted ones with auxiliary values) in the column file.
>
> 1) Off stats
> I can just +1 on the stats part. They can be "fixed" to not go off caused
> by the filler values, but the stats are off already anyway due to deletes,
> so not sure if this is something we want to fix.
>
> 2) More data due to filler values
> TLDR: there is no significant difference between sparse and dense
> in storage size
>
> The reason is the compression efficiency for the _pos column. I made some
> experiments on this front and encodings can help the dense representation.
> The more rows we delete, the more auxiliary values we have to use with the
> dense representation, this is true. On the other hand, the more rows we
> delete the worse the compression of the _pos column is for the sparse
> representation (assuming Parquet V2) due to holes in the sequence.
> The overhead of the missing positions for sparse seems to balance out the
> overhead of the presence of auxiliary values for dense.
>
> 3) We have to know the record count of the base files in the writer
> I don't think this is an available information now in the writer. We can,
> and in the PoC this is what we do, broadcast the "location -> record count"
> mapping to the writers for this.
>
> Alternatively, when scanning inputs for the writers, we can also query the
> '_deleted' metadata column. Using that we don't even have to broadcast the
> record counts.
>
> Summary:
> I think none of the cons for dense are deal breakers and I'm in favor of
> supporting a single representation. My preference is dense.
>
> Best Regards,
> Gabor
>
> Steven Wu <[email protected]> ezt írta (időpont: 2026. jún. 24., Sze,
> 22:59):
>
>> I agree with Peter's points here. While it seems flexible to have both
>> optioins, it essentially requires every engine/client to implement the more
>> complex read of sparse representation.
>>
>> I want to revisit the cons that Anurag summarized for the option 1
>> (filler values for deleted rows). To me, those arguments against filler
>> values seem relatively weak, and the pros (zero-copy stitching, simpler
>> reader implementation) outweigh the cons.
>>
>> > Filler values at deleted positions skew Parquet footer statistics
>> (null_count, avg_length)
>>
>> Writers can produce accurate statistics in the Iceberg metadata even with
>> filler values. I know the Java reference implementation currently just
>> takes the column stats from the Parquet writer. But some writer
>> implementations may choose to produce accurate stats in this case.
>>
>> There was also a concern that differing column statistics between Iceberg
>> metadata and the Parquet footer, caused by DVs, could be confusing. I want
>> to argue that this difference is actually reasonable.  DV is a table level
>> concept. With DVs, Iceberg metadata can have different and adjusted column
>> stats compared to the Parquet footer. Parquet is not aware of DVs, and the
>> Parquet footer only captures the stats for the content in the physical file.
>>
>> Today, we already have inaccurate stats with DVs. It is not a correctness
>> problem, it may have a small performance impact on pruning. Even if writer
>> implementations do nothing special for column files, we are no worse off
>> than today.
>>
>> > Writes slightly more data than necessary (filler values for deleted
>> rows)
>>
>> This depends on the percentage of deleted rows. Sparse representation
>> also has some small overhead for storing the encoded positions (even with
>> delta encoding).
>>
>> > Writer must know base_file.record_count to pad trailing deletions
>> (base file metadata already available during write planning)
>>
>> As already pointed out, the base file metadata already has the row count.
>> so it is not really a problem
>>
>> On Tue, Jun 23, 2026 at 9:05 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> I still have concerns with this decision:
>>> > Implementation Details: Specific writer implementation details such as
>>> choosing between dense or sparse representations will be left to individual
>>> engines.
>>> > Specification Scope: The specification will not mandate these internal
>>> implementation choices, provided that engines adhere to writing the
>>> explicit *_pos* column.
>>>
>>> If we do not specify whether the representation should be dense or
>>> sparse, we are effectively requiring all engines to support the sparse
>>> representation, since the dense representation is just a special case of
>>> the sparse one.
>>> In practice, this means every implementation must be able to materialize
>>> a dense representation from the sparse form, similar to what the current
>>> Spark implementation does today. While this is certainly feasible, it
>>> introduces an additional step on the read path, which is often
>>> performance-sensitive. This concern has been raised consistently by
>>> representatives of other Iceberg implementations, and I have not heard a
>>> different perspective from them so far.
>>>
>>> That said, if the broader group is comfortable accepting this trade-off,
>>> I do not have any further objections to the proposal.
>>>
>>> Thanks,
>>> Peter
>>>
>>> Anurag Mantripragada <[email protected]> ezt írta
>>> (időpont: 2026. jún. 16., K, 20:51):
>>>
>>>> Hi all,
>>>>
>>>> It seems this thread has become conflated with the metadata
>>>> representation discussion
>>>> <https://lists.apache.org/thread/7jryw9dfvc02s411twn4o7s5gjrybfxg>.
>>>> While all the points raised here are noted, let’s continue those specific
>>>> parts of the conversation in the metadata thread.
>>>>
>>>> Regarding data representation, we discussed the following during this
>>>> <https://www.youtube.com/watch?v=kuxFBm-j5hw&t=3s> sync:
>>>>
>>>>    -  Implementation Details: Specific writer implementation details
>>>>    such as choosing between dense or sparse representations will be left to
>>>>    individual engines.
>>>>    -  Specification Scope: The specification will not mandate these
>>>>    internal implementation choices, provided that engines adhere to writing
>>>>    the explicit *_pos* column.
>>>>
>>>> Please let me know if you have concerns.
>>>>
>>>> ~ Anurag
>>>>
>>>>
>>>> On Tue, Jun 2, 2026 at 11:44 AM Xiening Dai <[email protected]> wrote:
>>>>
>>>>> We also need to think about the DV only case.
>>>>>
>>>>> If we have f0 with dv0, then we do column update and generate f1. Do
>>>>> we also bump the sequence number for f0 in this case? There are multiple
>>>>> options:
>>>>>
>>>>> 1) We bump the sequence number, then we will need to copy dv0 into dv1
>>>>> and assign the same sequence number to dv1 so that the delete positions
>>>>> won't get lost.
>>>>> 2) We don't bump the sequence number, then we don't need to re-write
>>>>> dv0 and everything would remain working. But this creates a small
>>>>> inconsistency with eq delete case, and requires a special case handling at
>>>>> write path.
>>>>> 3) We bump sequence number for both data file f0, and dv0. We don't
>>>>> need to rewrite dv, but instead we bump the sequence number for the dv as
>>>>> well.
>>>>>
>>>>> I'd suggest we write down these details into a spec change proposal
>>>>> and examine the read write work flow carefully.
>>>>>
>>>>> On 2026/06/02 12:42:10 Gábor Kaszab wrote:
>>>>> > Thanks for the summary, Amogh!
>>>>> >
>>>>> > I think the missing building block to make this eq-delete rewrite
>>>>> work is
>>>>> > the decision made yesterday, to bump the base file-level sequence
>>>>> number
>>>>> > when adding a column file. With this, we can make sure that after we
>>>>> have
>>>>> > rewritten the eq-deletes into DVs in the process of adding column
>>>>> files, we
>>>>> > don't have to apply the eq-deletes we had previously on the base
>>>>> file.
>>>>> >
>>>>> > Just some thoughts on implementation:
>>>>> >
>>>>> >    - Write path in general: When writing the update file, we
>>>>> designed this
>>>>> >    in the PoC to receive _path and _pos from the base file. With
>>>>> this we can
>>>>> >    identify if some positions are missing and we can convert them
>>>>> into DVs
>>>>> >    - Trailing deletes: The tricky part is when trailing rows are
>>>>> deleted. I
>>>>> >    see 2 approaches to get around this:
>>>>> >       - Broadcast base file row counts to writers (this is done by
>>>>> the
>>>>> >       PoC): When we received the last row from the base file with
>>>>> pos X, but we
>>>>> >       know there are more rows in the base file, we have to add the
>>>>> trailing
>>>>> >       positions to the DV
>>>>> >       - Enrich the input rows fed to the writer with the "_deleted"
>>>>> >       metadata column. False => write to update file, true => write
>>>>> pos to DV
>>>>> >
>>>>> > Regards,
>>>>> > Gabor
>>>>> >
>>>>> > Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2026. jún.
>>>>> 1., H,
>>>>> > 22:48):
>>>>> >
>>>>> > > >The real challenge comes from the read path. In the case when we
>>>>> have a
>>>>> > > data file f0, an equality delete file d0, and column file f1, and
>>>>> the
>>>>> > > materialized dv d1. How do we reconcile the deletes during read?
>>>>> If we
>>>>> > > don't do anything special, following the existing spec (based on
>>>>> sequence
>>>>> > > number rule), we would apply d0 on f0, and then apply d1 on f1,
>>>>> which
>>>>> > > should still give us the correct results as both d0 and d1
>>>>> represent the
>>>>> > > same set of positions. But this is undesired because we dont want
>>>>> to load
>>>>> > > and re-evaluate the old column values. So we need a change in the
>>>>> spec so
>>>>> > > that in this scenario the new d1 supersede the existing equality
>>>>> delete
>>>>> > > file (d0).
>>>>> > >
>>>>> > > So given the following invariants/rules:
>>>>> > >
>>>>> > > 1. In a dense representation, column updates must carry over all
>>>>> active
>>>>> > > values for the column (and there's a _pos column referencing the
>>>>> position
>>>>> > > from the original base file).
>>>>> > > 2. Column updates must know what rows were deleted (either to omit
>>>>> the row
>>>>> > > or materialize the default value)
>>>>> > > 3. Data sequence numbers are updated on column appends/updates
>>>>> (this would
>>>>> > > be a spec change in v4). I think reusing the same seq. number is
>>>>> key since
>>>>> > > we don't have a different sequence number definition that's
>>>>> temporal in
>>>>> > > dimension for delete matching and another one that's not temporal
>>>>> but for
>>>>> > > column updates. Having a single sequence number simplifies a lot
>>>>> of this.
>>>>> > > 4. The requirement that a column update must also rewrite existing
>>>>> > > equality deletes into DV
>>>>> > >
>>>>> > > I think this combination (and the fact that DVs are 1:1 to with
>>>>> data
>>>>> > > files) naturally addresses this because
>>>>> > > f1 in this example would have the column values for all the active
>>>>> rows.
>>>>> > > Then the DV v1 just deletes row positions as usual. There's never
>>>>> a need to
>>>>> > > actually read the old column values in this model.
>>>>> > >
>>>>> > > There's a broader discussion around eliminating new equality
>>>>> deletes in v4
>>>>> > > but in that case this rule would still apply to handle older
>>>>> equality
>>>>> > > deletes from v3 and earlier + column updates on older data files
>>>>> as well.
>>>>> > >
>>>>> > > We actually talked about this a bit in todays v4 amt sync
>>>>> > > <https://youtu.be/7mVes-6pM1c?t=861>
>>>>> > >
>>>>> > > Thanks,
>>>>> > > Amogh Jahagirdar
>>>>> > >
>>>>> > > On Mon, Jun 1, 2026 at 12:17 PM Xiening Dai <[email protected]>
>>>>> wrote:
>>>>> > >
>>>>> > >> > but we should develop some concreteness around how feasible it
>>>>> is for
>>>>> > >> engines to produce the DVs on the column update.
>>>>> > >>
>>>>> > >> Actually I don't think this would be a problem. As mentioned, in
>>>>> order to
>>>>> > >> generate correct column file, we already need to product the
>>>>> correct set of
>>>>> > >> deleted positions, and we just need an extra step to materialize
>>>>> these
>>>>> > >> positions into DV.
>>>>> > >>
>>>>> > >> The real challenge comes from the read path. In the case when we
>>>>> have a
>>>>> > >> data file f0, an equality delete file d0, and column file f1, and
>>>>> the
>>>>> > >> materialized dv d1. How do we reconcile the deletes during read?
>>>>> If we
>>>>> > >> don't do anything special, following the existing spec (based on
>>>>> sequence
>>>>> > >> number rule), we would apply d0 on f0, and then apply d1 on f1,
>>>>> which
>>>>> > >> should still give us the correct results as both d0 and d1
>>>>> represent the
>>>>> > >> same set of positions. But this is undesired because we dont want
>>>>> to load
>>>>> > >> and re-evaluate the old column values. So we need a change in the
>>>>> spec so
>>>>> > >> that in this scenario the new d1 supersede the existing equality
>>>>> delete
>>>>> > >> file (d0).
>>>>> > >>
>>>>> > >> On 2026/05/29 23:21:33 Amogh Jahagirdar wrote:
>>>>> > >> > One approach that’s helped me reason about all this is to treat
>>>>> each
>>>>> > >> base
>>>>> > >> > file as its own little mini‑table inside the larger table: the
>>>>> row
>>>>> > >> range of
>>>>> > >> > the base file keyed by row_id, and column files/deletes just
>>>>> layer on
>>>>> > >> top.Once
>>>>> > >> > a row is deleted in that mini‑table, it stays deleted in that
>>>>> > >> mini‑table’s
>>>>> > >> > state (whether that’s via equality deletes, or DVs), and column
>>>>> updates
>>>>> > >> are
>>>>> > >> > just layering changed or additional columns on top of whatever
>>>>> rowsare
>>>>> > >> > still there. Then I can reason about "what are desirable
>>>>> properties of
>>>>> > >> this
>>>>> > >> > mini-table".
>>>>> > >> >
>>>>> > >> > Once I look at it that way, stacking equality deletes with
>>>>> column
>>>>> > >> updates
>>>>> > >> > on the same column, and then forcing the write path to read all
>>>>> the
>>>>> > >> older
>>>>> > >> > column files when producing new column updates, feels like the
>>>>> worst
>>>>> > >> > outcome; and it gets worse the more column updates there are
>>>>> for the
>>>>> > >> > column. It blows up complexity and performance and compromises
>>>>> the
>>>>> > >> value of
>>>>> > >> > efficient column updates.
>>>>> > >> >
>>>>> > >> > If we eliminate that option, I think we’re left with two
>>>>> high‑level
>>>>> > >> > approaches:
>>>>> > >> >
>>>>> > >> >    1. Equality deletes cannot be allowed with column updates.
>>>>> This
>>>>> > >> >    simplifies both the read and write paths when column update
>>>>> files are
>>>>> > >> >    present. I would generally prefer this option but there is a
>>>>> > >> legitimate
>>>>> > >> >    problem around the “how” for checking for the presence
>>>>> equality
>>>>> > >> deletes. We
>>>>> > >> >    can’t rely on snapshot summaries, which means we’d have to
>>>>> look at
>>>>> > >> delete
>>>>> > >> >    manifests to really know if equality deletes exist. There
>>>>> were ideas
>>>>> > >> in the
>>>>> > >> >    V4 AMT sync about constraining equality deletes to be in the
>>>>> root
>>>>> > >> manifest;
>>>>> > >> >    in that model, the amount of work needed to check for
>>>>> equality
>>>>> > >> deletes is
>>>>> > >> >    bounded by the root size. I’d keep that as a separate open
>>>>> question
>>>>> > >> because
>>>>> > >> >    there are other challenges with requiring equality deletes
>>>>> to only
>>>>> > >> appear
>>>>> > >> >    in the root manifest, especially on the upgrade path.
>>>>> > >> >    2. After an equality delete, subsequent updates must produce
>>>>> a DV. As
>>>>> > >> >    Xiening highlighted, once you’ve had an equality delete on a
>>>>> column,
>>>>> > >> any
>>>>> > >> >    subsequent updates on that column would be required to
>>>>> produce a DV
>>>>> > >> (or
>>>>> > >> >    positional delete) for the deleted positions at the new
>>>>> sequence
>>>>> > >> number,
>>>>> > >> >    making the original equality delete obsolete. This is
>>>>> attractive
>>>>> > >> because
>>>>> > >> >    it’s not too constraining for writers: they’re already doing
>>>>> the
>>>>> > >> work of
>>>>> > >> >    reconciling deleted positions to decide what to write into
>>>>> the
>>>>> > >> column file,
>>>>> > >> >    so the additional work is basically emitting the DV. The
>>>>> main thing
>>>>> > >> to
>>>>> > >> >    think through is how exactly the plumbing to engines looks,
>>>>> but in
>>>>> > >> theory
>>>>> > >> >    it’s just a matter of plumbing through explicitly deleted
>>>>> positions
>>>>> > >> (or,
>>>>> > >> >    less ideally, inferring them from a sentinel value in the
>>>>> tuple).
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > So far I’m leaning towards option 2, but we should develop some
>>>>> > >> > concreteness around how feasible it is for engines to produce
>>>>> the DVs on
>>>>> > >> > the column update. Again, should all be theoretically possible
>>>>> based off
>>>>> > >> > plumbing deleted positions; we shouldn't let implementations
>>>>> drive the
>>>>> > >> spec
>>>>> > >> > but I think sniff testing the practicality of it is well worth
>>>>> it to
>>>>> > >> make
>>>>> > >> > sure that restriction is reasonably implementable.
>>>>> > >> >
>>>>> > >> > Interested in hearing what others think about this one.
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > Thanks,
>>>>> > >> >
>>>>> > >> > Amogh Jahagirdar
>>>>> > >> >
>>>>> > >>
>>>>> > >
>>>>> >
>>>>>
>>>>

Re: [Discuss] Column Update File Representation

Reply via email to