Re: [Discuss] Column Update File Representation

Anurag Mantripragada Thu, 02 Jul 2026 13:44:10 -0700

Hi all,

Thanks for your input and for the discussion during this week’s sync meeting
<https://youtu.be/PGx4-GKqm6c?si=m95DTzUnvCbS39HY>. Following up on that
call, we have reached the following decisions:


*Column File Representation*
We agreed to mandate a column file representation that includes values for
deleted rows to align with the base file. We also decided to use NULL
values for these deleted positions. As a result, every leaf column in a
column update will be nullable, even if it was non-nullable in the base
file.

*Updates on Partition Fields*
The partition tuple in the file metadata must always match the file's
content. Updates are permitted, including moving a file to an unpartitioned
state by unsetting its tuple.

Next Steps

   - [Anurag]: Document the dense format and filler strategy requirements,
   along with recommendations for updating states in the spec.
   - [All]: Please review the initial PR
   <https://github.com/apache/iceberg/pull/16285> for column files strcuts.


~ Anurag Mantripragada

On Sun, Jun 28, 2026 at 10:03 PM Anurag Mantripragada <
[email protected]> wrote:

> Hi all,
>
> The arguments for mandating the dense (positional) representation are
> compelling, and I'm convinced. I'll add it as an agenda item for the next
> sync to formally confirm and close this. If anyone has remaining objections
>  please raise them here during the upcoming sync.
>
> Thanks for all the input.
>
> ~ Anurag
>
> On Thu, Jun 25, 2026 at 3:23 PM Andrei Tserakhau via dev <
> [email protected]> wrote:
>
>> > There is a discussion about column families that has been punted for
>> later. Separate column files for column families can be a desired and
>> optimal long-term state.
>> Agree here.
>>
>> With column families, separate column files become an intentional
>> long-term layout, not a transient overlay that compaction folds away.
>>
>> I'd treat column families as a separate effort, and I think it's the
>> effort that actually motivates sparse. The column-update use case is
>> whole-column refresh - that's clearly dense. The regime where sparse earns
>> its place is long-lived files updated repeatedly in part, because there
>> full-coverage rewrites get wasteful - and that regime is exactly column
>> families.
>>
>> So the representation can track the feature instead of being decided up
>> front for both:
>> - column updates: whole-column refresh -> dense.
>> - column families: persistent separate files, repeated partial updates ->
>> where sparse earns its place, decided as part of that design.
>>
>> That keeps today's reader simple (positional, dense only), and it gives
>> sparse a concrete trigger - it comes in with column families, as its own
>> format update - instead of an open-ended "maybe later" that forces every
>> reader to support it now.
>>
>> So still +1 on starting dense. I'd just frame sparse as part of the
>> column families effort when we pick that back up, rather than a
>> representation choice we have to settle today.
>>
>> Best,
>> Andrei
>>
>> On Fri, Jun 26, 2026 at 12:08 AM Steven Wu <[email protected]> wrote:
>>
>>> > because column files are short-lived. Compaction rewrites them back
>>> into the base files regularly, so there is no long-lived dense corpus
>>> to migrate.
>>>
>>> I don't necessarily agree that column files are short lived. There is a
>>> discussion about column families that has been punted for later. Separate
>>> column files for column families can be a desired and optimal long-term
>>> state.
>>>
>>> I would also favor starting with the dense representation (fillter
>>> values for delete rows).
>>>
>>>
>>> On Thu, Jun 25, 2026 at 2:58 PM Andrei Tserakhau via dev <
>>> [email protected]> wrote:
>>>
>>>> +1 on picking dense as the single representation now, rather than
>>>> leaving it up to the engine.
>>>>
>>>> The reason I'd mandate it, not just allow it, is the asymmetry. Dense
>>>> is the special case of sparse, so mandating the special case is the
>>>> smallest thing every reader has to implement: a positional substitution, no
>>>> scatter, no merge-on-read stacks. And it covers the dominant workload
>>>> directly - refreshing a whole column (a new column from an expression, or
>>>> overwriting an existing column with new values like embeddings or model
>>>> weights) is full-coverage by nature, not a point update to a few rows.
>>>>
>>>> Key thing here is that going dense now does not close the door on
>>>> sparse.
>>>>
>>>> A sparse-capable reader is a superset of a dense one - it can read
>>>> dense files too, since full coverage is just sparse with every position
>>>> present. So adding sparse later is an additive format version: it widens
>>>> the reader, it does not break existing dense files, and tables that never
>>>> need sparse never pay for it. The reverse is not true. If we allow sparse
>>>> now, every engine and client has to implement the harder merge-on-read path
>>>> from day one, for row-level partial updates the current workloads are not
>>>> asking for.
>>>>
>>>> On whether dense is a one-way door: I don't think it is, because column
>>>> files are short-lived. Compaction rewrites them back into the base files
>>>> regularly, so there is no long-lived dense corpus to migrate. If we add
>>>> sparse later, old dense files age out through normal compaction, or we
>>>> rewrite them - and because they are transient, that cost is bounded and
>>>> amortized, not a table-wide migration.
>>>>
>>>> So my preference is to specify dense now and keep sparse as a
>>>> documented future extension with its own format version, rather than
>>>> leaving the representation unspecified. Leaving it open is the worst of the
>>>> three: as Peter pointed out, it forces every reader to support sparse
>>>> anyway, which is the exact cost we would be trying to defer.
>>>>
>>>> Best,
>>>> Andrei
>>>>
>>>> On Thu, Jun 25, 2026 at 7:58 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> > We can, and in the PoC this is what we do, broadcast the "location
>>>>> -> record count" mapping to the writers for this.
>>>>>
>>>>> I am wondering if column file generation usually needs to scan the
>>>>> existing base files (or column files) anyway. Otherwise, a default value
>>>>> column (with expressions) should probably be sufficient. So the writer
>>>>> probably already has the data file metadata.
>>>>>
>>>>> Plus, carrying over additional contextual information (like manifest
>>>>> file location and entry position) is very beneficial, as the snapshot
>>>>> producer can generate manifest DVs efficiently without scanning manifest
>>>>> files to locate the old TrackedFile entry to delete (maybe via manifest 
>>>>> DV).
>>>>>
>>>>> > Alternatively, when scanning inputs for the writers, we can also
>>>>> query the '_deleted' metadata column.
>>>>>
>>>>> I agree; this is another nice way to solve this problem assuming the
>>>>> base file or older column files need to be scanned.
>>>>>
>>>>> On Wed, Jun 24, 2026 at 11:15 PM Gábor Kaszab <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I share Steven's opinion that the cons against the dense
>>>>>> representation aren't that strong, and the implementation seems more
>>>>>> straightforward across projects and languages, if we keep an invariant to
>>>>>> have all the rows (even the deleted ones with auxiliary values) in the
>>>>>> column file.
>>>>>>
>>>>>> 1) Off stats
>>>>>> I can just +1 on the stats part. They can be "fixed" to not go off
>>>>>> caused by the filler values, but the stats are off already anyway due to
>>>>>> deletes, so not sure if this is something we want to fix.
>>>>>>
>>>>>> 2) More data due to filler values
>>>>>> TLDR: there is no significant difference between sparse and dense
>>>>>> in storage size
>>>>>>
>>>>>> The reason is the compression efficiency for the _pos column. I made
>>>>>> some experiments on this front and encodings can help the dense
>>>>>> representation. The more rows we delete, the more auxiliary values we 
>>>>>> have
>>>>>> to use with the dense representation, this is true. On the other hand, 
>>>>>> the
>>>>>> more rows we delete the worse the compression of the _pos column is for 
>>>>>> the
>>>>>> sparse representation (assuming Parquet V2) due to holes in the sequence.
>>>>>> The overhead of the missing positions for sparse seems to balance out
>>>>>> the overhead of the presence of auxiliary values for dense.
>>>>>>
>>>>>> 3) We have to know the record count of the base files in the writer
>>>>>> I don't think this is an available information now in the writer. We
>>>>>> can, and in the PoC this is what we do, broadcast the "location -> record
>>>>>> count" mapping to the writers for this.
>>>>>>
>>>>>> Alternatively, when scanning inputs for the writers, we can also
>>>>>> query the '_deleted' metadata column. Using that we don't even have to
>>>>>> broadcast the record counts.
>>>>>>
>>>>>> Summary:
>>>>>> I think none of the cons for dense are deal breakers and I'm in favor
>>>>>> of supporting a single representation. My preference is dense.
>>>>>>
>>>>>> Best Regards,
>>>>>> Gabor
>>>>>>
>>>>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. jún. 24.,
>>>>>> Sze, 22:59):
>>>>>>
>>>>>>> I agree with Peter's points here. While it seems flexible to have
>>>>>>> both optioins, it essentially requires every engine/client to implement 
>>>>>>> the
>>>>>>> more complex read of sparse representation.
>>>>>>>
>>>>>>> I want to revisit the cons that Anurag summarized for the option 1
>>>>>>> (filler values for deleted rows). To me, those arguments against filler
>>>>>>> values seem relatively weak, and the pros (zero-copy stitching, simpler
>>>>>>> reader implementation) outweigh the cons.
>>>>>>>
>>>>>>> > Filler values at deleted positions skew Parquet footer statistics
>>>>>>> (null_count, avg_length)
>>>>>>>
>>>>>>> Writers can produce accurate statistics in the Iceberg metadata even
>>>>>>> with filler values. I know the Java reference implementation currently 
>>>>>>> just
>>>>>>> takes the column stats from the Parquet writer. But some writer
>>>>>>> implementations may choose to produce accurate stats in this case.
>>>>>>>
>>>>>>> There was also a concern that differing column statistics between
>>>>>>> Iceberg metadata and the Parquet footer, caused by DVs, could be 
>>>>>>> confusing.
>>>>>>> I want to argue that this difference is actually reasonable.  DV is a 
>>>>>>> table
>>>>>>> level concept. With DVs, Iceberg metadata can have different and 
>>>>>>> adjusted
>>>>>>> column stats compared to the Parquet footer. Parquet is not aware of 
>>>>>>> DVs,
>>>>>>> and the Parquet footer only captures the stats for the content in the
>>>>>>> physical file.
>>>>>>>
>>>>>>> Today, we already have inaccurate stats with DVs. It is not a
>>>>>>> correctness problem, it may have a small performance impact on pruning.
>>>>>>> Even if writer implementations do nothing special for column files, we 
>>>>>>> are
>>>>>>> no worse off than today.
>>>>>>>
>>>>>>> > Writes slightly more data than necessary (filler values for
>>>>>>> deleted rows)
>>>>>>>
>>>>>>> This depends on the percentage of deleted rows. Sparse
>>>>>>> representation also has some small overhead for storing the encoded
>>>>>>> positions (even with delta encoding).
>>>>>>>
>>>>>>> > Writer must know base_file.record_count to pad trailing deletions
>>>>>>> (base file metadata already available during write planning)
>>>>>>>
>>>>>>> As already pointed out, the base file metadata already has the row
>>>>>>> count. so it is not really a problem
>>>>>>>
>>>>>>> On Tue, Jun 23, 2026 at 9:05 AM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I still have concerns with this decision:
>>>>>>>> > Implementation Details: Specific writer implementation details
>>>>>>>> such as choosing between dense or sparse representations will be left 
>>>>>>>> to
>>>>>>>> individual engines.
>>>>>>>> > Specification Scope: The specification will not mandate these
>>>>>>>> internal implementation choices, provided that engines adhere to 
>>>>>>>> writing
>>>>>>>> the explicit *_pos* column.
>>>>>>>>
>>>>>>>> If we do not specify whether the representation should be dense or
>>>>>>>> sparse, we are effectively requiring all engines to support the sparse
>>>>>>>> representation, since the dense representation is just a special case 
>>>>>>>> of
>>>>>>>> the sparse one.
>>>>>>>> In practice, this means every implementation must be able to
>>>>>>>> materialize a dense representation from the sparse form, similar to 
>>>>>>>> what
>>>>>>>> the current Spark implementation does today. While this is certainly
>>>>>>>> feasible, it introduces an additional step on the read path, which is 
>>>>>>>> often
>>>>>>>> performance-sensitive. This concern has been raised consistently by
>>>>>>>> representatives of other Iceberg implementations, and I have not heard 
>>>>>>>> a
>>>>>>>> different perspective from them so far.
>>>>>>>>
>>>>>>>> That said, if the broader group is comfortable accepting this
>>>>>>>> trade-off, I do not have any further objections to the proposal.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Anurag Mantripragada <[email protected]> ezt írta
>>>>>>>> (időpont: 2026. jún. 16., K, 20:51):
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> It seems this thread has become conflated with the metadata
>>>>>>>>> representation discussion
>>>>>>>>> <https://lists.apache.org/thread/7jryw9dfvc02s411twn4o7s5gjrybfxg>.
>>>>>>>>> While all the points raised here are noted, let’s continue those 
>>>>>>>>> specific
>>>>>>>>> parts of the conversation in the metadata thread.
>>>>>>>>>
>>>>>>>>> Regarding data representation, we discussed the following during
>>>>>>>>> this <https://www.youtube.com/watch?v=kuxFBm-j5hw&t=3s> sync:
>>>>>>>>>
>>>>>>>>>    -  Implementation Details: Specific writer implementation
>>>>>>>>>    details such as choosing between dense or sparse representations 
>>>>>>>>> will be
>>>>>>>>>    left to individual engines.
>>>>>>>>>    -  Specification Scope: The specification will not mandate
>>>>>>>>>    these internal implementation choices, provided that engines 
>>>>>>>>> adhere to
>>>>>>>>>    writing the explicit *_pos* column.
>>>>>>>>>
>>>>>>>>> Please let me know if you have concerns.
>>>>>>>>>
>>>>>>>>> ~ Anurag
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 2, 2026 at 11:44 AM Xiening Dai <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> We also need to think about the DV only case.
>>>>>>>>>>
>>>>>>>>>> If we have f0 with dv0, then we do column update and generate f1.
>>>>>>>>>> Do we also bump the sequence number for f0 in this case? There are 
>>>>>>>>>> multiple
>>>>>>>>>> options:
>>>>>>>>>>
>>>>>>>>>> 1) We bump the sequence number, then we will need to copy dv0
>>>>>>>>>> into dv1 and assign the same sequence number to dv1 so that the 
>>>>>>>>>> delete
>>>>>>>>>> positions won't get lost.
>>>>>>>>>> 2) We don't bump the sequence number, then we don't need to
>>>>>>>>>> re-write dv0 and everything would remain working. But this creates a 
>>>>>>>>>> small
>>>>>>>>>> inconsistency with eq delete case, and requires a special case 
>>>>>>>>>> handling at
>>>>>>>>>> write path.
>>>>>>>>>> 3) We bump sequence number for both data file f0, and dv0. We
>>>>>>>>>> don't need to rewrite dv, but instead we bump the sequence number 
>>>>>>>>>> for the
>>>>>>>>>> dv as well.
>>>>>>>>>>
>>>>>>>>>> I'd suggest we write down these details into a spec change
>>>>>>>>>> proposal and examine the read write work flow carefully.
>>>>>>>>>>
>>>>>>>>>> On 2026/06/02 12:42:10 Gábor Kaszab wrote:
>>>>>>>>>> > Thanks for the summary, Amogh!
>>>>>>>>>> >
>>>>>>>>>> > I think the missing building block to make this eq-delete
>>>>>>>>>> rewrite work is
>>>>>>>>>> > the decision made yesterday, to bump the base file-level
>>>>>>>>>> sequence number
>>>>>>>>>> > when adding a column file. With this, we can make sure that
>>>>>>>>>> after we have
>>>>>>>>>> > rewritten the eq-deletes into DVs in the process of adding
>>>>>>>>>> column files, we
>>>>>>>>>> > don't have to apply the eq-deletes we had previously on the
>>>>>>>>>> base file.
>>>>>>>>>> >
>>>>>>>>>> > Just some thoughts on implementation:
>>>>>>>>>> >
>>>>>>>>>> >    - Write path in general: When writing the update file, we
>>>>>>>>>> designed this
>>>>>>>>>> >    in the PoC to receive _path and _pos from the base file.
>>>>>>>>>> With this we can
>>>>>>>>>> >    identify if some positions are missing and we can convert
>>>>>>>>>> them into DVs
>>>>>>>>>> >    - Trailing deletes: The tricky part is when trailing rows
>>>>>>>>>> are deleted. I
>>>>>>>>>> >    see 2 approaches to get around this:
>>>>>>>>>> >       - Broadcast base file row counts to writers (this is done
>>>>>>>>>> by the
>>>>>>>>>> >       PoC): When we received the last row from the base file
>>>>>>>>>> with pos X, but we
>>>>>>>>>> >       know there are more rows in the base file, we have to add
>>>>>>>>>> the trailing
>>>>>>>>>> >       positions to the DV
>>>>>>>>>> >       - Enrich the input rows fed to the writer with the
>>>>>>>>>> "_deleted"
>>>>>>>>>> >       metadata column. False => write to update file, true =>
>>>>>>>>>> write pos to DV
>>>>>>>>>> >
>>>>>>>>>> > Regards,
>>>>>>>>>> > Gabor
>>>>>>>>>> >
>>>>>>>>>> > Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2026.
>>>>>>>>>> jún. 1., H,
>>>>>>>>>> > 22:48):
>>>>>>>>>> >
>>>>>>>>>> > > >The real challenge comes from the read path. In the case
>>>>>>>>>> when we have a
>>>>>>>>>> > > data file f0, an equality delete file d0, and column file f1,
>>>>>>>>>> and the
>>>>>>>>>> > > materialized dv d1. How do we reconcile the deletes during
>>>>>>>>>> read? If we
>>>>>>>>>> > > don't do anything special, following the existing spec (based
>>>>>>>>>> on sequence
>>>>>>>>>> > > number rule), we would apply d0 on f0, and then apply d1 on
>>>>>>>>>> f1, which
>>>>>>>>>> > > should still give us the correct results as both d0 and d1
>>>>>>>>>> represent the
>>>>>>>>>> > > same set of positions. But this is undesired because we dont
>>>>>>>>>> want to load
>>>>>>>>>> > > and re-evaluate the old column values. So we need a change in
>>>>>>>>>> the spec so
>>>>>>>>>> > > that in this scenario the new d1 supersede the existing
>>>>>>>>>> equality delete
>>>>>>>>>> > > file (d0).
>>>>>>>>>> > >
>>>>>>>>>> > > So given the following invariants/rules:
>>>>>>>>>> > >
>>>>>>>>>> > > 1. In a dense representation, column updates must carry over
>>>>>>>>>> all active
>>>>>>>>>> > > values for the column (and there's a _pos column referencing
>>>>>>>>>> the position
>>>>>>>>>> > > from the original base file).
>>>>>>>>>> > > 2. Column updates must know what rows were deleted (either to
>>>>>>>>>> omit the row
>>>>>>>>>> > > or materialize the default value)
>>>>>>>>>> > > 3. Data sequence numbers are updated on column
>>>>>>>>>> appends/updates (this would
>>>>>>>>>> > > be a spec change in v4). I think reusing the same seq. number
>>>>>>>>>> is key since
>>>>>>>>>> > > we don't have a different sequence number definition that's
>>>>>>>>>> temporal in
>>>>>>>>>> > > dimension for delete matching and another one that's not
>>>>>>>>>> temporal but for
>>>>>>>>>> > > column updates. Having a single sequence number simplifies a
>>>>>>>>>> lot of this.
>>>>>>>>>> > > 4. The requirement that a column update must also rewrite
>>>>>>>>>> existing
>>>>>>>>>> > > equality deletes into DV
>>>>>>>>>> > >
>>>>>>>>>> > > I think this combination (and the fact that DVs are 1:1 to
>>>>>>>>>> with data
>>>>>>>>>> > > files) naturally addresses this because
>>>>>>>>>> > > f1 in this example would have the column values for all the
>>>>>>>>>> active rows.
>>>>>>>>>> > > Then the DV v1 just deletes row positions as usual. There's
>>>>>>>>>> never a need to
>>>>>>>>>> > > actually read the old column values in this model.
>>>>>>>>>> > >
>>>>>>>>>> > > There's a broader discussion around eliminating new equality
>>>>>>>>>> deletes in v4
>>>>>>>>>> > > but in that case this rule would still apply to handle older
>>>>>>>>>> equality
>>>>>>>>>> > > deletes from v3 and earlier + column updates on older data
>>>>>>>>>> files as well.
>>>>>>>>>> > >
>>>>>>>>>> > > We actually talked about this a bit in todays v4 amt sync
>>>>>>>>>> > > <https://youtu.be/7mVes-6pM1c?t=861>
>>>>>>>>>> > >
>>>>>>>>>> > > Thanks,
>>>>>>>>>> > > Amogh Jahagirdar
>>>>>>>>>> > >
>>>>>>>>>> > > On Mon, Jun 1, 2026 at 12:17 PM Xiening Dai <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> > >
>>>>>>>>>> > >> > but we should develop some concreteness around how
>>>>>>>>>> feasible it is for
>>>>>>>>>> > >> engines to produce the DVs on the column update.
>>>>>>>>>> > >>
>>>>>>>>>> > >> Actually I don't think this would be a problem. As
>>>>>>>>>> mentioned, in order to
>>>>>>>>>> > >> generate correct column file, we already need to product the
>>>>>>>>>> correct set of
>>>>>>>>>> > >> deleted positions, and we just need an extra step to
>>>>>>>>>> materialize these
>>>>>>>>>> > >> positions into DV.
>>>>>>>>>> > >>
>>>>>>>>>> > >> The real challenge comes from the read path. In the case
>>>>>>>>>> when we have a
>>>>>>>>>> > >> data file f0, an equality delete file d0, and column file
>>>>>>>>>> f1, and the
>>>>>>>>>> > >> materialized dv d1. How do we reconcile the deletes during
>>>>>>>>>> read? If we
>>>>>>>>>> > >> don't do anything special, following the existing spec
>>>>>>>>>> (based on sequence
>>>>>>>>>> > >> number rule), we would apply d0 on f0, and then apply d1 on
>>>>>>>>>> f1, which
>>>>>>>>>> > >> should still give us the correct results as both d0 and d1
>>>>>>>>>> represent the
>>>>>>>>>> > >> same set of positions. But this is undesired because we dont
>>>>>>>>>> want to load
>>>>>>>>>> > >> and re-evaluate the old column values. So we need a change
>>>>>>>>>> in the spec so
>>>>>>>>>> > >> that in this scenario the new d1 supersede the existing
>>>>>>>>>> equality delete
>>>>>>>>>> > >> file (d0).
>>>>>>>>>> > >>
>>>>>>>>>> > >> On 2026/05/29 23:21:33 Amogh Jahagirdar wrote:
>>>>>>>>>> > >> > One approach that’s helped me reason about all this is to
>>>>>>>>>> treat each
>>>>>>>>>> > >> base
>>>>>>>>>> > >> > file as its own little mini‑table inside the larger table:
>>>>>>>>>> the row
>>>>>>>>>> > >> range of
>>>>>>>>>> > >> > the base file keyed by row_id, and column files/deletes
>>>>>>>>>> just layer on
>>>>>>>>>> > >> top.Once
>>>>>>>>>> > >> > a row is deleted in that mini‑table, it stays deleted in
>>>>>>>>>> that
>>>>>>>>>> > >> mini‑table’s
>>>>>>>>>> > >> > state (whether that’s via equality deletes, or DVs), and
>>>>>>>>>> column updates
>>>>>>>>>> > >> are
>>>>>>>>>> > >> > just layering changed or additional columns on top of
>>>>>>>>>> whatever rowsare
>>>>>>>>>> > >> > still there. Then I can reason about "what are desirable
>>>>>>>>>> properties of
>>>>>>>>>> > >> this
>>>>>>>>>> > >> > mini-table".
>>>>>>>>>> > >> >
>>>>>>>>>> > >> > Once I look at it that way, stacking equality deletes with
>>>>>>>>>> column
>>>>>>>>>> > >> updates
>>>>>>>>>> > >> > on the same column, and then forcing the write path to
>>>>>>>>>> read all the
>>>>>>>>>> > >> older
>>>>>>>>>> > >> > column files when producing new column updates, feels like
>>>>>>>>>> the worst
>>>>>>>>>> > >> > outcome; and it gets worse the more column updates there
>>>>>>>>>> are for the
>>>>>>>>>> > >> > column. It blows up complexity and performance and
>>>>>>>>>> compromises the
>>>>>>>>>> > >> value of
>>>>>>>>>> > >> > efficient column updates.
>>>>>>>>>> > >> >
>>>>>>>>>> > >> > If we eliminate that option, I think we’re left with two
>>>>>>>>>> high‑level
>>>>>>>>>> > >> > approaches:
>>>>>>>>>> > >> >
>>>>>>>>>> > >> >    1. Equality deletes cannot be allowed with column
>>>>>>>>>> updates. This
>>>>>>>>>> > >> >    simplifies both the read and write paths when column
>>>>>>>>>> update files are
>>>>>>>>>> > >> >    present. I would generally prefer this option but there
>>>>>>>>>> is a
>>>>>>>>>> > >> legitimate
>>>>>>>>>> > >> >    problem around the “how” for checking for the presence
>>>>>>>>>> equality
>>>>>>>>>> > >> deletes. We
>>>>>>>>>> > >> >    can’t rely on snapshot summaries, which means we’d have
>>>>>>>>>> to look at
>>>>>>>>>> > >> delete
>>>>>>>>>> > >> >    manifests to really know if equality deletes exist.
>>>>>>>>>> There were ideas
>>>>>>>>>> > >> in the
>>>>>>>>>> > >> >    V4 AMT sync about constraining equality deletes to be
>>>>>>>>>> in the root
>>>>>>>>>> > >> manifest;
>>>>>>>>>> > >> >    in that model, the amount of work needed to check for
>>>>>>>>>> equality
>>>>>>>>>> > >> deletes is
>>>>>>>>>> > >> >    bounded by the root size. I’d keep that as a separate
>>>>>>>>>> open question
>>>>>>>>>> > >> because
>>>>>>>>>> > >> >    there are other challenges with requiring equality
>>>>>>>>>> deletes to only
>>>>>>>>>> > >> appear
>>>>>>>>>> > >> >    in the root manifest, especially on the upgrade path.
>>>>>>>>>> > >> >    2. After an equality delete, subsequent updates must
>>>>>>>>>> produce a DV. As
>>>>>>>>>> > >> >    Xiening highlighted, once you’ve had an equality delete
>>>>>>>>>> on a column,
>>>>>>>>>> > >> any
>>>>>>>>>> > >> >    subsequent updates on that column would be required to
>>>>>>>>>> produce a DV
>>>>>>>>>> > >> (or
>>>>>>>>>> > >> >    positional delete) for the deleted positions at the new
>>>>>>>>>> sequence
>>>>>>>>>> > >> number,
>>>>>>>>>> > >> >    making the original equality delete obsolete. This is
>>>>>>>>>> attractive
>>>>>>>>>> > >> because
>>>>>>>>>> > >> >    it’s not too constraining for writers: they’re already
>>>>>>>>>> doing the
>>>>>>>>>> > >> work of
>>>>>>>>>> > >> >    reconciling deleted positions to decide what to write
>>>>>>>>>> into the
>>>>>>>>>> > >> column file,
>>>>>>>>>> > >> >    so the additional work is basically emitting the DV.
>>>>>>>>>> The main thing
>>>>>>>>>> > >> to
>>>>>>>>>> > >> >    think through is how exactly the plumbing to engines
>>>>>>>>>> looks, but in
>>>>>>>>>> > >> theory
>>>>>>>>>> > >> >    it’s just a matter of plumbing through explicitly
>>>>>>>>>> deleted positions
>>>>>>>>>> > >> (or,
>>>>>>>>>> > >> >    less ideally, inferring them from a sentinel value in
>>>>>>>>>> the tuple).
>>>>>>>>>> > >> >
>>>>>>>>>> > >> >
>>>>>>>>>> > >> > So far I’m leaning towards option 2, but we should develop
>>>>>>>>>> some
>>>>>>>>>> > >> > concreteness around how feasible it is for engines to
>>>>>>>>>> produce the DVs on
>>>>>>>>>> > >> > the column update. Again, should all be theoretically
>>>>>>>>>> possible based off
>>>>>>>>>> > >> > plumbing deleted positions; we shouldn't let
>>>>>>>>>> implementations drive the
>>>>>>>>>> > >> spec
>>>>>>>>>> > >> > but I think sniff testing the practicality of it is well
>>>>>>>>>> worth it to
>>>>>>>>>> > >> make
>>>>>>>>>> > >> > sure that restriction is reasonably implementable.
>>>>>>>>>> > >> >
>>>>>>>>>> > >> > Interested in hearing what others think about this one.
>>>>>>>>>> > >> >
>>>>>>>>>> > >> >
>>>>>>>>>> > >> > Thanks,
>>>>>>>>>> > >> >
>>>>>>>>>> > >> > Amogh Jahagirdar
>>>>>>>>>> > >> >
>>>>>>>>>> > >>
>>>>>>>>>> > >
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>

Re: [Discuss] Column Update File Representation

Reply via email to