Re: [Discuss] Column Update File Representation

Amogh Jahagirdar Fri, 29 May 2026 16:35:55 -0700

sorry, "or a positional delete" should not have been mentioned in my point
2 above, it should just be a DV.


On Fri, May 29, 2026 at 5:21 PM Amogh Jahagirdar <[email protected]> wrote:

> One approach that’s helped me reason about all this is to treat each base
> file as its own little mini‑table inside the larger table: the row range of
> the base file keyed by row_id, and column files/deletes just layer on top.Once
> a row is deleted in that mini‑table, it stays deleted in that mini‑table’s
> state (whether that’s via equality deletes, or DVs), and column updates are
> just layering changed or additional columns on top of whatever rowsare
> still there. Then I can reason about "what are desirable properties of this
> mini-table".
>
> Once I look at it that way, stacking equality deletes with column updates
> on the same column, and then forcing the write path to read all the older
> column files when producing new column updates, feels like the worst
> outcome; and it gets worse the more column updates there are for the
> column. It blows up complexity and performance and compromises the value of
> efficient column updates.
>
> If we eliminate that option, I think we’re left with two high‑level
> approaches:
>
>    1. Equality deletes cannot be allowed with column updates. This
>    simplifies both the read and write paths when column update files are
>    present. I would generally prefer this option but there is a legitimate
>    problem around the “how” for checking for the presence equality deletes. We
>    can’t rely on snapshot summaries, which means we’d have to look at delete
>    manifests to really know if equality deletes exist. There were ideas in the
>    V4 AMT sync about constraining equality deletes to be in the root manifest;
>    in that model, the amount of work needed to check for equality deletes is
>    bounded by the root size. I’d keep that as a separate open question because
>    there are other challenges with requiring equality deletes to only appear
>    in the root manifest, especially on the upgrade path.
>    2. After an equality delete, subsequent updates must produce a DV. As
>    Xiening highlighted, once you’ve had an equality delete on a column, any
>    subsequent updates on that column would be required to produce a DV (or
>    positional delete) for the deleted positions at the new sequence number,
>    making the original equality delete obsolete. This is attractive because
>    it’s not too constraining for writers: they’re already doing the work of
>    reconciling deleted positions to decide what to write into the column file,
>    so the additional work is basically emitting the DV. The main thing to
>    think through is how exactly the plumbing to engines looks, but in theory
>    it’s just a matter of plumbing through explicitly deleted positions (or,
>    less ideally, inferring them from a sentinel value in the tuple).
>
>
> So far I’m leaning towards option 2, but we should develop some
> concreteness around how feasible it is for engines to produce the DVs on
> the column update. Again, should all be theoretically possible based off
> plumbing deleted positions; we shouldn't let implementations drive the spec
> but I think sniff testing the practicality of it is well worth it to make
> sure that restriction is reasonably implementable.
>
> Interested in hearing what others think about this one.
>
>
> Thanks,
>
> Amogh Jahagirdar
>
>
>

Re: [Discuss] Column Update File Representation

Reply via email to