> because column files are short-lived. Compaction rewrites them back into the base files regularly, so there is no long-lived dense corpus to migrate.
I don't necessarily agree that column files are short lived. There is a discussion about column families that has been punted for later. Separate column files for column families can be a desired and optimal long-term state. I would also favor starting with the dense representation (fillter values for delete rows). On Thu, Jun 25, 2026 at 2:58 PM Andrei Tserakhau via dev < [email protected]> wrote: > +1 on picking dense as the single representation now, rather than leaving > it up to the engine. > > The reason I'd mandate it, not just allow it, is the asymmetry. Dense is > the special case of sparse, so mandating the special case is the smallest > thing every reader has to implement: a positional substitution, no scatter, > no merge-on-read stacks. And it covers the dominant workload directly - > refreshing a whole column (a new column from an expression, or overwriting > an existing column with new values like embeddings or model weights) is > full-coverage by nature, not a point update to a few rows. > > Key thing here is that going dense now does not close the door on sparse. > > A sparse-capable reader is a superset of a dense one - it can read dense > files too, since full coverage is just sparse with every position present. > So adding sparse later is an additive format version: it widens the reader, > it does not break existing dense files, and tables that never need sparse > never pay for it. The reverse is not true. If we allow sparse now, every > engine and client has to implement the harder merge-on-read path from day > one, for row-level partial updates the current workloads are not asking for. > > On whether dense is a one-way door: I don't think it is, because column > files are short-lived. Compaction rewrites them back into the base files > regularly, so there is no long-lived dense corpus to migrate. If we add > sparse later, old dense files age out through normal compaction, or we > rewrite them - and because they are transient, that cost is bounded and > amortized, not a table-wide migration. > > So my preference is to specify dense now and keep sparse as a documented > future extension with its own format version, rather than leaving the > representation unspecified. Leaving it open is the worst of the three: as > Peter pointed out, it forces every reader to support sparse anyway, which > is the exact cost we would be trying to defer. > > Best, > Andrei > > On Thu, Jun 25, 2026 at 7:58 PM Steven Wu <[email protected]> wrote: > >> > We can, and in the PoC this is what we do, broadcast the "location -> >> record count" mapping to the writers for this. >> >> I am wondering if column file generation usually needs to scan the >> existing base files (or column files) anyway. Otherwise, a default value >> column (with expressions) should probably be sufficient. So the writer >> probably already has the data file metadata. >> >> Plus, carrying over additional contextual information (like manifest file >> location and entry position) is very beneficial, as the snapshot producer >> can generate manifest DVs efficiently without scanning manifest files to >> locate the old TrackedFile entry to delete (maybe via manifest DV). >> >> > Alternatively, when scanning inputs for the writers, we can also query >> the '_deleted' metadata column. >> >> I agree; this is another nice way to solve this problem assuming the base >> file or older column files need to be scanned. >> >> On Wed, Jun 24, 2026 at 11:15 PM Gábor Kaszab <[email protected]> >> wrote: >> >>> Hi All, >>> >>> I share Steven's opinion that the cons against the dense representation >>> aren't that strong, and the implementation seems more straightforward >>> across projects and languages, if we keep an invariant to have all the rows >>> (even the deleted ones with auxiliary values) in the column file. >>> >>> 1) Off stats >>> I can just +1 on the stats part. They can be "fixed" to not go off >>> caused by the filler values, but the stats are off already anyway due to >>> deletes, so not sure if this is something we want to fix. >>> >>> 2) More data due to filler values >>> TLDR: there is no significant difference between sparse and dense >>> in storage size >>> >>> The reason is the compression efficiency for the _pos column. I made >>> some experiments on this front and encodings can help the dense >>> representation. The more rows we delete, the more auxiliary values we have >>> to use with the dense representation, this is true. On the other hand, the >>> more rows we delete the worse the compression of the _pos column is for the >>> sparse representation (assuming Parquet V2) due to holes in the sequence. >>> The overhead of the missing positions for sparse seems to balance out >>> the overhead of the presence of auxiliary values for dense. >>> >>> 3) We have to know the record count of the base files in the writer >>> I don't think this is an available information now in the writer. We >>> can, and in the PoC this is what we do, broadcast the "location -> record >>> count" mapping to the writers for this. >>> >>> Alternatively, when scanning inputs for the writers, we can also query >>> the '_deleted' metadata column. Using that we don't even have to broadcast >>> the record counts. >>> >>> Summary: >>> I think none of the cons for dense are deal breakers and I'm in favor of >>> supporting a single representation. My preference is dense. >>> >>> Best Regards, >>> Gabor >>> >>> Steven Wu <[email protected]> ezt írta (időpont: 2026. jún. 24., >>> Sze, 22:59): >>> >>>> I agree with Peter's points here. While it seems flexible to have both >>>> optioins, it essentially requires every engine/client to implement the more >>>> complex read of sparse representation. >>>> >>>> I want to revisit the cons that Anurag summarized for the option 1 >>>> (filler values for deleted rows). To me, those arguments against filler >>>> values seem relatively weak, and the pros (zero-copy stitching, simpler >>>> reader implementation) outweigh the cons. >>>> >>>> > Filler values at deleted positions skew Parquet footer statistics >>>> (null_count, avg_length) >>>> >>>> Writers can produce accurate statistics in the Iceberg metadata even >>>> with filler values. I know the Java reference implementation currently just >>>> takes the column stats from the Parquet writer. But some writer >>>> implementations may choose to produce accurate stats in this case. >>>> >>>> There was also a concern that differing column statistics between >>>> Iceberg metadata and the Parquet footer, caused by DVs, could be confusing. >>>> I want to argue that this difference is actually reasonable. DV is a table >>>> level concept. With DVs, Iceberg metadata can have different and adjusted >>>> column stats compared to the Parquet footer. Parquet is not aware of DVs, >>>> and the Parquet footer only captures the stats for the content in the >>>> physical file. >>>> >>>> Today, we already have inaccurate stats with DVs. It is not a >>>> correctness problem, it may have a small performance impact on pruning. >>>> Even if writer implementations do nothing special for column files, we are >>>> no worse off than today. >>>> >>>> > Writes slightly more data than necessary (filler values for deleted >>>> rows) >>>> >>>> This depends on the percentage of deleted rows. Sparse representation >>>> also has some small overhead for storing the encoded positions (even with >>>> delta encoding). >>>> >>>> > Writer must know base_file.record_count to pad trailing deletions >>>> (base file metadata already available during write planning) >>>> >>>> As already pointed out, the base file metadata already has the row >>>> count. so it is not really a problem >>>> >>>> On Tue, Jun 23, 2026 at 9:05 AM Péter Váry <[email protected]> >>>> wrote: >>>> >>>>> I still have concerns with this decision: >>>>> > Implementation Details: Specific writer implementation details such >>>>> as choosing between dense or sparse representations will be left to >>>>> individual engines. >>>>> > Specification Scope: The specification will not mandate these >>>>> internal implementation choices, provided that engines adhere to writing >>>>> the explicit *_pos* column. >>>>> >>>>> If we do not specify whether the representation should be dense or >>>>> sparse, we are effectively requiring all engines to support the sparse >>>>> representation, since the dense representation is just a special case of >>>>> the sparse one. >>>>> In practice, this means every implementation must be able to >>>>> materialize a dense representation from the sparse form, similar to what >>>>> the current Spark implementation does today. While this is certainly >>>>> feasible, it introduces an additional step on the read path, which is >>>>> often >>>>> performance-sensitive. This concern has been raised consistently by >>>>> representatives of other Iceberg implementations, and I have not heard a >>>>> different perspective from them so far. >>>>> >>>>> That said, if the broader group is comfortable accepting this >>>>> trade-off, I do not have any further objections to the proposal. >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> Anurag Mantripragada <[email protected]> ezt írta >>>>> (időpont: 2026. jún. 16., K, 20:51): >>>>> >>>>>> Hi all, >>>>>> >>>>>> It seems this thread has become conflated with the metadata >>>>>> representation discussion >>>>>> <https://lists.apache.org/thread/7jryw9dfvc02s411twn4o7s5gjrybfxg>. >>>>>> While all the points raised here are noted, let’s continue those specific >>>>>> parts of the conversation in the metadata thread. >>>>>> >>>>>> Regarding data representation, we discussed the following during this >>>>>> <https://www.youtube.com/watch?v=kuxFBm-j5hw&t=3s> sync: >>>>>> >>>>>> - Implementation Details: Specific writer implementation details >>>>>> such as choosing between dense or sparse representations will be left >>>>>> to >>>>>> individual engines. >>>>>> - Specification Scope: The specification will not mandate these >>>>>> internal implementation choices, provided that engines adhere to >>>>>> writing >>>>>> the explicit *_pos* column. >>>>>> >>>>>> Please let me know if you have concerns. >>>>>> >>>>>> ~ Anurag >>>>>> >>>>>> >>>>>> On Tue, Jun 2, 2026 at 11:44 AM Xiening Dai <[email protected]> wrote: >>>>>> >>>>>>> We also need to think about the DV only case. >>>>>>> >>>>>>> If we have f0 with dv0, then we do column update and generate f1. Do >>>>>>> we also bump the sequence number for f0 in this case? There are multiple >>>>>>> options: >>>>>>> >>>>>>> 1) We bump the sequence number, then we will need to copy dv0 into >>>>>>> dv1 and assign the same sequence number to dv1 so that the delete >>>>>>> positions >>>>>>> won't get lost. >>>>>>> 2) We don't bump the sequence number, then we don't need to re-write >>>>>>> dv0 and everything would remain working. But this creates a small >>>>>>> inconsistency with eq delete case, and requires a special case handling >>>>>>> at >>>>>>> write path. >>>>>>> 3) We bump sequence number for both data file f0, and dv0. We don't >>>>>>> need to rewrite dv, but instead we bump the sequence number for the dv >>>>>>> as >>>>>>> well. >>>>>>> >>>>>>> I'd suggest we write down these details into a spec change proposal >>>>>>> and examine the read write work flow carefully. >>>>>>> >>>>>>> On 2026/06/02 12:42:10 Gábor Kaszab wrote: >>>>>>> > Thanks for the summary, Amogh! >>>>>>> > >>>>>>> > I think the missing building block to make this eq-delete rewrite >>>>>>> work is >>>>>>> > the decision made yesterday, to bump the base file-level sequence >>>>>>> number >>>>>>> > when adding a column file. With this, we can make sure that after >>>>>>> we have >>>>>>> > rewritten the eq-deletes into DVs in the process of adding column >>>>>>> files, we >>>>>>> > don't have to apply the eq-deletes we had previously on the base >>>>>>> file. >>>>>>> > >>>>>>> > Just some thoughts on implementation: >>>>>>> > >>>>>>> > - Write path in general: When writing the update file, we >>>>>>> designed this >>>>>>> > in the PoC to receive _path and _pos from the base file. With >>>>>>> this we can >>>>>>> > identify if some positions are missing and we can convert them >>>>>>> into DVs >>>>>>> > - Trailing deletes: The tricky part is when trailing rows are >>>>>>> deleted. I >>>>>>> > see 2 approaches to get around this: >>>>>>> > - Broadcast base file row counts to writers (this is done by >>>>>>> the >>>>>>> > PoC): When we received the last row from the base file with >>>>>>> pos X, but we >>>>>>> > know there are more rows in the base file, we have to add >>>>>>> the trailing >>>>>>> > positions to the DV >>>>>>> > - Enrich the input rows fed to the writer with the "_deleted" >>>>>>> > metadata column. False => write to update file, true => >>>>>>> write pos to DV >>>>>>> > >>>>>>> > Regards, >>>>>>> > Gabor >>>>>>> > >>>>>>> > Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2026. jún. >>>>>>> 1., H, >>>>>>> > 22:48): >>>>>>> > >>>>>>> > > >The real challenge comes from the read path. In the case when >>>>>>> we have a >>>>>>> > > data file f0, an equality delete file d0, and column file f1, >>>>>>> and the >>>>>>> > > materialized dv d1. How do we reconcile the deletes during read? >>>>>>> If we >>>>>>> > > don't do anything special, following the existing spec (based on >>>>>>> sequence >>>>>>> > > number rule), we would apply d0 on f0, and then apply d1 on f1, >>>>>>> which >>>>>>> > > should still give us the correct results as both d0 and d1 >>>>>>> represent the >>>>>>> > > same set of positions. But this is undesired because we dont >>>>>>> want to load >>>>>>> > > and re-evaluate the old column values. So we need a change in >>>>>>> the spec so >>>>>>> > > that in this scenario the new d1 supersede the existing equality >>>>>>> delete >>>>>>> > > file (d0). >>>>>>> > > >>>>>>> > > So given the following invariants/rules: >>>>>>> > > >>>>>>> > > 1. In a dense representation, column updates must carry over all >>>>>>> active >>>>>>> > > values for the column (and there's a _pos column referencing the >>>>>>> position >>>>>>> > > from the original base file). >>>>>>> > > 2. Column updates must know what rows were deleted (either to >>>>>>> omit the row >>>>>>> > > or materialize the default value) >>>>>>> > > 3. Data sequence numbers are updated on column appends/updates >>>>>>> (this would >>>>>>> > > be a spec change in v4). I think reusing the same seq. number is >>>>>>> key since >>>>>>> > > we don't have a different sequence number definition that's >>>>>>> temporal in >>>>>>> > > dimension for delete matching and another one that's not >>>>>>> temporal but for >>>>>>> > > column updates. Having a single sequence number simplifies a lot >>>>>>> of this. >>>>>>> > > 4. The requirement that a column update must also rewrite >>>>>>> existing >>>>>>> > > equality deletes into DV >>>>>>> > > >>>>>>> > > I think this combination (and the fact that DVs are 1:1 to with >>>>>>> data >>>>>>> > > files) naturally addresses this because >>>>>>> > > f1 in this example would have the column values for all the >>>>>>> active rows. >>>>>>> > > Then the DV v1 just deletes row positions as usual. There's >>>>>>> never a need to >>>>>>> > > actually read the old column values in this model. >>>>>>> > > >>>>>>> > > There's a broader discussion around eliminating new equality >>>>>>> deletes in v4 >>>>>>> > > but in that case this rule would still apply to handle older >>>>>>> equality >>>>>>> > > deletes from v3 and earlier + column updates on older data files >>>>>>> as well. >>>>>>> > > >>>>>>> > > We actually talked about this a bit in todays v4 amt sync >>>>>>> > > <https://youtu.be/7mVes-6pM1c?t=861> >>>>>>> > > >>>>>>> > > Thanks, >>>>>>> > > Amogh Jahagirdar >>>>>>> > > >>>>>>> > > On Mon, Jun 1, 2026 at 12:17 PM Xiening Dai <[email protected]> >>>>>>> wrote: >>>>>>> > > >>>>>>> > >> > but we should develop some concreteness around how feasible >>>>>>> it is for >>>>>>> > >> engines to produce the DVs on the column update. >>>>>>> > >> >>>>>>> > >> Actually I don't think this would be a problem. As mentioned, >>>>>>> in order to >>>>>>> > >> generate correct column file, we already need to product the >>>>>>> correct set of >>>>>>> > >> deleted positions, and we just need an extra step to >>>>>>> materialize these >>>>>>> > >> positions into DV. >>>>>>> > >> >>>>>>> > >> The real challenge comes from the read path. In the case when >>>>>>> we have a >>>>>>> > >> data file f0, an equality delete file d0, and column file f1, >>>>>>> and the >>>>>>> > >> materialized dv d1. How do we reconcile the deletes during >>>>>>> read? If we >>>>>>> > >> don't do anything special, following the existing spec (based >>>>>>> on sequence >>>>>>> > >> number rule), we would apply d0 on f0, and then apply d1 on f1, >>>>>>> which >>>>>>> > >> should still give us the correct results as both d0 and d1 >>>>>>> represent the >>>>>>> > >> same set of positions. But this is undesired because we dont >>>>>>> want to load >>>>>>> > >> and re-evaluate the old column values. So we need a change in >>>>>>> the spec so >>>>>>> > >> that in this scenario the new d1 supersede the existing >>>>>>> equality delete >>>>>>> > >> file (d0). >>>>>>> > >> >>>>>>> > >> On 2026/05/29 23:21:33 Amogh Jahagirdar wrote: >>>>>>> > >> > One approach that’s helped me reason about all this is to >>>>>>> treat each >>>>>>> > >> base >>>>>>> > >> > file as its own little mini‑table inside the larger table: >>>>>>> the row >>>>>>> > >> range of >>>>>>> > >> > the base file keyed by row_id, and column files/deletes just >>>>>>> layer on >>>>>>> > >> top.Once >>>>>>> > >> > a row is deleted in that mini‑table, it stays deleted in that >>>>>>> > >> mini‑table’s >>>>>>> > >> > state (whether that’s via equality deletes, or DVs), and >>>>>>> column updates >>>>>>> > >> are >>>>>>> > >> > just layering changed or additional columns on top of >>>>>>> whatever rowsare >>>>>>> > >> > still there. Then I can reason about "what are desirable >>>>>>> properties of >>>>>>> > >> this >>>>>>> > >> > mini-table". >>>>>>> > >> > >>>>>>> > >> > Once I look at it that way, stacking equality deletes with >>>>>>> column >>>>>>> > >> updates >>>>>>> > >> > on the same column, and then forcing the write path to read >>>>>>> all the >>>>>>> > >> older >>>>>>> > >> > column files when producing new column updates, feels like >>>>>>> the worst >>>>>>> > >> > outcome; and it gets worse the more column updates there are >>>>>>> for the >>>>>>> > >> > column. It blows up complexity and performance and >>>>>>> compromises the >>>>>>> > >> value of >>>>>>> > >> > efficient column updates. >>>>>>> > >> > >>>>>>> > >> > If we eliminate that option, I think we’re left with two >>>>>>> high‑level >>>>>>> > >> > approaches: >>>>>>> > >> > >>>>>>> > >> > 1. Equality deletes cannot be allowed with column updates. >>>>>>> This >>>>>>> > >> > simplifies both the read and write paths when column >>>>>>> update files are >>>>>>> > >> > present. I would generally prefer this option but there is >>>>>>> a >>>>>>> > >> legitimate >>>>>>> > >> > problem around the “how” for checking for the presence >>>>>>> equality >>>>>>> > >> deletes. We >>>>>>> > >> > can’t rely on snapshot summaries, which means we’d have to >>>>>>> look at >>>>>>> > >> delete >>>>>>> > >> > manifests to really know if equality deletes exist. There >>>>>>> were ideas >>>>>>> > >> in the >>>>>>> > >> > V4 AMT sync about constraining equality deletes to be in >>>>>>> the root >>>>>>> > >> manifest; >>>>>>> > >> > in that model, the amount of work needed to check for >>>>>>> equality >>>>>>> > >> deletes is >>>>>>> > >> > bounded by the root size. I’d keep that as a separate open >>>>>>> question >>>>>>> > >> because >>>>>>> > >> > there are other challenges with requiring equality deletes >>>>>>> to only >>>>>>> > >> appear >>>>>>> > >> > in the root manifest, especially on the upgrade path. >>>>>>> > >> > 2. After an equality delete, subsequent updates must >>>>>>> produce a DV. As >>>>>>> > >> > Xiening highlighted, once you’ve had an equality delete on >>>>>>> a column, >>>>>>> > >> any >>>>>>> > >> > subsequent updates on that column would be required to >>>>>>> produce a DV >>>>>>> > >> (or >>>>>>> > >> > positional delete) for the deleted positions at the new >>>>>>> sequence >>>>>>> > >> number, >>>>>>> > >> > making the original equality delete obsolete. This is >>>>>>> attractive >>>>>>> > >> because >>>>>>> > >> > it’s not too constraining for writers: they’re already >>>>>>> doing the >>>>>>> > >> work of >>>>>>> > >> > reconciling deleted positions to decide what to write into >>>>>>> the >>>>>>> > >> column file, >>>>>>> > >> > so the additional work is basically emitting the DV. The >>>>>>> main thing >>>>>>> > >> to >>>>>>> > >> > think through is how exactly the plumbing to engines >>>>>>> looks, but in >>>>>>> > >> theory >>>>>>> > >> > it’s just a matter of plumbing through explicitly deleted >>>>>>> positions >>>>>>> > >> (or, >>>>>>> > >> > less ideally, inferring them from a sentinel value in the >>>>>>> tuple). >>>>>>> > >> > >>>>>>> > >> > >>>>>>> > >> > So far I’m leaning towards option 2, but we should develop >>>>>>> some >>>>>>> > >> > concreteness around how feasible it is for engines to produce >>>>>>> the DVs on >>>>>>> > >> > the column update. Again, should all be theoretically >>>>>>> possible based off >>>>>>> > >> > plumbing deleted positions; we shouldn't let implementations >>>>>>> drive the >>>>>>> > >> spec >>>>>>> > >> > but I think sniff testing the practicality of it is well >>>>>>> worth it to >>>>>>> > >> make >>>>>>> > >> > sure that restriction is reasonably implementable. >>>>>>> > >> > >>>>>>> > >> > Interested in hearing what others think about this one. >>>>>>> > >> > >>>>>>> > >> > >>>>>>> > >> > Thanks, >>>>>>> > >> > >>>>>>> > >> > Amogh Jahagirdar >>>>>>> > >> > >>>>>>> > >> >>>>>>> > > >>>>>>> > >>>>>>> >>>>>>
