> There is a discussion about column families that has been punted for later. Separate column files for column families can be a desired and optimal long-term state. Agree here.
With column families, separate column files become an intentional long-term layout, not a transient overlay that compaction folds away. I'd treat column families as a separate effort, and I think it's the effort that actually motivates sparse. The column-update use case is whole-column refresh - that's clearly dense. The regime where sparse earns its place is long-lived files updated repeatedly in part, because there full-coverage rewrites get wasteful - and that regime is exactly column families. So the representation can track the feature instead of being decided up front for both: - column updates: whole-column refresh -> dense. - column families: persistent separate files, repeated partial updates -> where sparse earns its place, decided as part of that design. That keeps today's reader simple (positional, dense only), and it gives sparse a concrete trigger - it comes in with column families, as its own format update - instead of an open-ended "maybe later" that forces every reader to support it now. So still +1 on starting dense. I'd just frame sparse as part of the column families effort when we pick that back up, rather than a representation choice we have to settle today. Best, Andrei On Fri, Jun 26, 2026 at 12:08 AM Steven Wu <[email protected]> wrote: > > because column files are short-lived. Compaction rewrites them back > into the base files regularly, so there is no long-lived dense corpus to > migrate. > > I don't necessarily agree that column files are short lived. There is a > discussion about column families that has been punted for later. Separate > column files for column families can be a desired and optimal long-term > state. > > I would also favor starting with the dense representation (fillter values > for delete rows). > > > On Thu, Jun 25, 2026 at 2:58 PM Andrei Tserakhau via dev < > [email protected]> wrote: > >> +1 on picking dense as the single representation now, rather than leaving >> it up to the engine. >> >> The reason I'd mandate it, not just allow it, is the asymmetry. Dense is >> the special case of sparse, so mandating the special case is the smallest >> thing every reader has to implement: a positional substitution, no scatter, >> no merge-on-read stacks. And it covers the dominant workload directly - >> refreshing a whole column (a new column from an expression, or overwriting >> an existing column with new values like embeddings or model weights) is >> full-coverage by nature, not a point update to a few rows. >> >> Key thing here is that going dense now does not close the door on sparse. >> >> A sparse-capable reader is a superset of a dense one - it can read dense >> files too, since full coverage is just sparse with every position present. >> So adding sparse later is an additive format version: it widens the reader, >> it does not break existing dense files, and tables that never need sparse >> never pay for it. The reverse is not true. If we allow sparse now, every >> engine and client has to implement the harder merge-on-read path from day >> one, for row-level partial updates the current workloads are not asking for. >> >> On whether dense is a one-way door: I don't think it is, because column >> files are short-lived. Compaction rewrites them back into the base files >> regularly, so there is no long-lived dense corpus to migrate. If we add >> sparse later, old dense files age out through normal compaction, or we >> rewrite them - and because they are transient, that cost is bounded and >> amortized, not a table-wide migration. >> >> So my preference is to specify dense now and keep sparse as a documented >> future extension with its own format version, rather than leaving the >> representation unspecified. Leaving it open is the worst of the three: as >> Peter pointed out, it forces every reader to support sparse anyway, which >> is the exact cost we would be trying to defer. >> >> Best, >> Andrei >> >> On Thu, Jun 25, 2026 at 7:58 PM Steven Wu <[email protected]> wrote: >> >>> > We can, and in the PoC this is what we do, broadcast the "location -> >>> record count" mapping to the writers for this. >>> >>> I am wondering if column file generation usually needs to scan the >>> existing base files (or column files) anyway. Otherwise, a default value >>> column (with expressions) should probably be sufficient. So the writer >>> probably already has the data file metadata. >>> >>> Plus, carrying over additional contextual information (like manifest >>> file location and entry position) is very beneficial, as the snapshot >>> producer can generate manifest DVs efficiently without scanning manifest >>> files to locate the old TrackedFile entry to delete (maybe via manifest DV). >>> >>> > Alternatively, when scanning inputs for the writers, we can also query >>> the '_deleted' metadata column. >>> >>> I agree; this is another nice way to solve this problem assuming the >>> base file or older column files need to be scanned. >>> >>> On Wed, Jun 24, 2026 at 11:15 PM Gábor Kaszab <[email protected]> >>> wrote: >>> >>>> Hi All, >>>> >>>> I share Steven's opinion that the cons against the dense representation >>>> aren't that strong, and the implementation seems more straightforward >>>> across projects and languages, if we keep an invariant to have all the rows >>>> (even the deleted ones with auxiliary values) in the column file. >>>> >>>> 1) Off stats >>>> I can just +1 on the stats part. They can be "fixed" to not go off >>>> caused by the filler values, but the stats are off already anyway due to >>>> deletes, so not sure if this is something we want to fix. >>>> >>>> 2) More data due to filler values >>>> TLDR: there is no significant difference between sparse and dense >>>> in storage size >>>> >>>> The reason is the compression efficiency for the _pos column. I made >>>> some experiments on this front and encodings can help the dense >>>> representation. The more rows we delete, the more auxiliary values we have >>>> to use with the dense representation, this is true. On the other hand, the >>>> more rows we delete the worse the compression of the _pos column is for the >>>> sparse representation (assuming Parquet V2) due to holes in the sequence. >>>> The overhead of the missing positions for sparse seems to balance out >>>> the overhead of the presence of auxiliary values for dense. >>>> >>>> 3) We have to know the record count of the base files in the writer >>>> I don't think this is an available information now in the writer. We >>>> can, and in the PoC this is what we do, broadcast the "location -> record >>>> count" mapping to the writers for this. >>>> >>>> Alternatively, when scanning inputs for the writers, we can also query >>>> the '_deleted' metadata column. Using that we don't even have to broadcast >>>> the record counts. >>>> >>>> Summary: >>>> I think none of the cons for dense are deal breakers and I'm in favor >>>> of supporting a single representation. My preference is dense. >>>> >>>> Best Regards, >>>> Gabor >>>> >>>> Steven Wu <[email protected]> ezt írta (időpont: 2026. jún. 24., >>>> Sze, 22:59): >>>> >>>>> I agree with Peter's points here. While it seems flexible to have both >>>>> optioins, it essentially requires every engine/client to implement the >>>>> more >>>>> complex read of sparse representation. >>>>> >>>>> I want to revisit the cons that Anurag summarized for the option 1 >>>>> (filler values for deleted rows). To me, those arguments against filler >>>>> values seem relatively weak, and the pros (zero-copy stitching, simpler >>>>> reader implementation) outweigh the cons. >>>>> >>>>> > Filler values at deleted positions skew Parquet footer statistics >>>>> (null_count, avg_length) >>>>> >>>>> Writers can produce accurate statistics in the Iceberg metadata even >>>>> with filler values. I know the Java reference implementation currently >>>>> just >>>>> takes the column stats from the Parquet writer. But some writer >>>>> implementations may choose to produce accurate stats in this case. >>>>> >>>>> There was also a concern that differing column statistics between >>>>> Iceberg metadata and the Parquet footer, caused by DVs, could be >>>>> confusing. >>>>> I want to argue that this difference is actually reasonable. DV is a >>>>> table >>>>> level concept. With DVs, Iceberg metadata can have different and adjusted >>>>> column stats compared to the Parquet footer. Parquet is not aware of DVs, >>>>> and the Parquet footer only captures the stats for the content in the >>>>> physical file. >>>>> >>>>> Today, we already have inaccurate stats with DVs. It is not a >>>>> correctness problem, it may have a small performance impact on pruning. >>>>> Even if writer implementations do nothing special for column files, we are >>>>> no worse off than today. >>>>> >>>>> > Writes slightly more data than necessary (filler values for deleted >>>>> rows) >>>>> >>>>> This depends on the percentage of deleted rows. Sparse representation >>>>> also has some small overhead for storing the encoded positions (even with >>>>> delta encoding). >>>>> >>>>> > Writer must know base_file.record_count to pad trailing deletions >>>>> (base file metadata already available during write planning) >>>>> >>>>> As already pointed out, the base file metadata already has the row >>>>> count. so it is not really a problem >>>>> >>>>> On Tue, Jun 23, 2026 at 9:05 AM Péter Váry < >>>>> [email protected]> wrote: >>>>> >>>>>> I still have concerns with this decision: >>>>>> > Implementation Details: Specific writer implementation details such >>>>>> as choosing between dense or sparse representations will be left to >>>>>> individual engines. >>>>>> > Specification Scope: The specification will not mandate these >>>>>> internal implementation choices, provided that engines adhere to writing >>>>>> the explicit *_pos* column. >>>>>> >>>>>> If we do not specify whether the representation should be dense or >>>>>> sparse, we are effectively requiring all engines to support the sparse >>>>>> representation, since the dense representation is just a special case of >>>>>> the sparse one. >>>>>> In practice, this means every implementation must be able to >>>>>> materialize a dense representation from the sparse form, similar to what >>>>>> the current Spark implementation does today. While this is certainly >>>>>> feasible, it introduces an additional step on the read path, which is >>>>>> often >>>>>> performance-sensitive. This concern has been raised consistently by >>>>>> representatives of other Iceberg implementations, and I have not heard a >>>>>> different perspective from them so far. >>>>>> >>>>>> That said, if the broader group is comfortable accepting this >>>>>> trade-off, I do not have any further objections to the proposal. >>>>>> >>>>>> Thanks, >>>>>> Peter >>>>>> >>>>>> Anurag Mantripragada <[email protected]> ezt írta >>>>>> (időpont: 2026. jún. 16., K, 20:51): >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> It seems this thread has become conflated with the metadata >>>>>>> representation discussion >>>>>>> <https://lists.apache.org/thread/7jryw9dfvc02s411twn4o7s5gjrybfxg>. >>>>>>> While all the points raised here are noted, let’s continue those >>>>>>> specific >>>>>>> parts of the conversation in the metadata thread. >>>>>>> >>>>>>> Regarding data representation, we discussed the following during >>>>>>> this <https://www.youtube.com/watch?v=kuxFBm-j5hw&t=3s> sync: >>>>>>> >>>>>>> - Implementation Details: Specific writer implementation >>>>>>> details such as choosing between dense or sparse representations >>>>>>> will be >>>>>>> left to individual engines. >>>>>>> - Specification Scope: The specification will not mandate these >>>>>>> internal implementation choices, provided that engines adhere to >>>>>>> writing >>>>>>> the explicit *_pos* column. >>>>>>> >>>>>>> Please let me know if you have concerns. >>>>>>> >>>>>>> ~ Anurag >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 2, 2026 at 11:44 AM Xiening Dai <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> We also need to think about the DV only case. >>>>>>>> >>>>>>>> If we have f0 with dv0, then we do column update and generate f1. >>>>>>>> Do we also bump the sequence number for f0 in this case? There are >>>>>>>> multiple >>>>>>>> options: >>>>>>>> >>>>>>>> 1) We bump the sequence number, then we will need to copy dv0 into >>>>>>>> dv1 and assign the same sequence number to dv1 so that the delete >>>>>>>> positions >>>>>>>> won't get lost. >>>>>>>> 2) We don't bump the sequence number, then we don't need to >>>>>>>> re-write dv0 and everything would remain working. But this creates a >>>>>>>> small >>>>>>>> inconsistency with eq delete case, and requires a special case >>>>>>>> handling at >>>>>>>> write path. >>>>>>>> 3) We bump sequence number for both data file f0, and dv0. We don't >>>>>>>> need to rewrite dv, but instead we bump the sequence number for the dv >>>>>>>> as >>>>>>>> well. >>>>>>>> >>>>>>>> I'd suggest we write down these details into a spec change proposal >>>>>>>> and examine the read write work flow carefully. >>>>>>>> >>>>>>>> On 2026/06/02 12:42:10 Gábor Kaszab wrote: >>>>>>>> > Thanks for the summary, Amogh! >>>>>>>> > >>>>>>>> > I think the missing building block to make this eq-delete rewrite >>>>>>>> work is >>>>>>>> > the decision made yesterday, to bump the base file-level sequence >>>>>>>> number >>>>>>>> > when adding a column file. With this, we can make sure that after >>>>>>>> we have >>>>>>>> > rewritten the eq-deletes into DVs in the process of adding column >>>>>>>> files, we >>>>>>>> > don't have to apply the eq-deletes we had previously on the base >>>>>>>> file. >>>>>>>> > >>>>>>>> > Just some thoughts on implementation: >>>>>>>> > >>>>>>>> > - Write path in general: When writing the update file, we >>>>>>>> designed this >>>>>>>> > in the PoC to receive _path and _pos from the base file. With >>>>>>>> this we can >>>>>>>> > identify if some positions are missing and we can convert them >>>>>>>> into DVs >>>>>>>> > - Trailing deletes: The tricky part is when trailing rows are >>>>>>>> deleted. I >>>>>>>> > see 2 approaches to get around this: >>>>>>>> > - Broadcast base file row counts to writers (this is done >>>>>>>> by the >>>>>>>> > PoC): When we received the last row from the base file with >>>>>>>> pos X, but we >>>>>>>> > know there are more rows in the base file, we have to add >>>>>>>> the trailing >>>>>>>> > positions to the DV >>>>>>>> > - Enrich the input rows fed to the writer with the >>>>>>>> "_deleted" >>>>>>>> > metadata column. False => write to update file, true => >>>>>>>> write pos to DV >>>>>>>> > >>>>>>>> > Regards, >>>>>>>> > Gabor >>>>>>>> > >>>>>>>> > Amogh Jahagirdar <[email protected]> ezt írta (időpont: 2026. >>>>>>>> jún. 1., H, >>>>>>>> > 22:48): >>>>>>>> > >>>>>>>> > > >The real challenge comes from the read path. In the case when >>>>>>>> we have a >>>>>>>> > > data file f0, an equality delete file d0, and column file f1, >>>>>>>> and the >>>>>>>> > > materialized dv d1. How do we reconcile the deletes during >>>>>>>> read? If we >>>>>>>> > > don't do anything special, following the existing spec (based >>>>>>>> on sequence >>>>>>>> > > number rule), we would apply d0 on f0, and then apply d1 on f1, >>>>>>>> which >>>>>>>> > > should still give us the correct results as both d0 and d1 >>>>>>>> represent the >>>>>>>> > > same set of positions. But this is undesired because we dont >>>>>>>> want to load >>>>>>>> > > and re-evaluate the old column values. So we need a change in >>>>>>>> the spec so >>>>>>>> > > that in this scenario the new d1 supersede the existing >>>>>>>> equality delete >>>>>>>> > > file (d0). >>>>>>>> > > >>>>>>>> > > So given the following invariants/rules: >>>>>>>> > > >>>>>>>> > > 1. In a dense representation, column updates must carry over >>>>>>>> all active >>>>>>>> > > values for the column (and there's a _pos column referencing >>>>>>>> the position >>>>>>>> > > from the original base file). >>>>>>>> > > 2. Column updates must know what rows were deleted (either to >>>>>>>> omit the row >>>>>>>> > > or materialize the default value) >>>>>>>> > > 3. Data sequence numbers are updated on column appends/updates >>>>>>>> (this would >>>>>>>> > > be a spec change in v4). I think reusing the same seq. number >>>>>>>> is key since >>>>>>>> > > we don't have a different sequence number definition that's >>>>>>>> temporal in >>>>>>>> > > dimension for delete matching and another one that's not >>>>>>>> temporal but for >>>>>>>> > > column updates. Having a single sequence number simplifies a >>>>>>>> lot of this. >>>>>>>> > > 4. The requirement that a column update must also rewrite >>>>>>>> existing >>>>>>>> > > equality deletes into DV >>>>>>>> > > >>>>>>>> > > I think this combination (and the fact that DVs are 1:1 to with >>>>>>>> data >>>>>>>> > > files) naturally addresses this because >>>>>>>> > > f1 in this example would have the column values for all the >>>>>>>> active rows. >>>>>>>> > > Then the DV v1 just deletes row positions as usual. There's >>>>>>>> never a need to >>>>>>>> > > actually read the old column values in this model. >>>>>>>> > > >>>>>>>> > > There's a broader discussion around eliminating new equality >>>>>>>> deletes in v4 >>>>>>>> > > but in that case this rule would still apply to handle older >>>>>>>> equality >>>>>>>> > > deletes from v3 and earlier + column updates on older data >>>>>>>> files as well. >>>>>>>> > > >>>>>>>> > > We actually talked about this a bit in todays v4 amt sync >>>>>>>> > > <https://youtu.be/7mVes-6pM1c?t=861> >>>>>>>> > > >>>>>>>> > > Thanks, >>>>>>>> > > Amogh Jahagirdar >>>>>>>> > > >>>>>>>> > > On Mon, Jun 1, 2026 at 12:17 PM Xiening Dai <[email protected]> >>>>>>>> wrote: >>>>>>>> > > >>>>>>>> > >> > but we should develop some concreteness around how feasible >>>>>>>> it is for >>>>>>>> > >> engines to produce the DVs on the column update. >>>>>>>> > >> >>>>>>>> > >> Actually I don't think this would be a problem. As mentioned, >>>>>>>> in order to >>>>>>>> > >> generate correct column file, we already need to product the >>>>>>>> correct set of >>>>>>>> > >> deleted positions, and we just need an extra step to >>>>>>>> materialize these >>>>>>>> > >> positions into DV. >>>>>>>> > >> >>>>>>>> > >> The real challenge comes from the read path. In the case when >>>>>>>> we have a >>>>>>>> > >> data file f0, an equality delete file d0, and column file f1, >>>>>>>> and the >>>>>>>> > >> materialized dv d1. How do we reconcile the deletes during >>>>>>>> read? If we >>>>>>>> > >> don't do anything special, following the existing spec (based >>>>>>>> on sequence >>>>>>>> > >> number rule), we would apply d0 on f0, and then apply d1 on >>>>>>>> f1, which >>>>>>>> > >> should still give us the correct results as both d0 and d1 >>>>>>>> represent the >>>>>>>> > >> same set of positions. But this is undesired because we dont >>>>>>>> want to load >>>>>>>> > >> and re-evaluate the old column values. So we need a change in >>>>>>>> the spec so >>>>>>>> > >> that in this scenario the new d1 supersede the existing >>>>>>>> equality delete >>>>>>>> > >> file (d0). >>>>>>>> > >> >>>>>>>> > >> On 2026/05/29 23:21:33 Amogh Jahagirdar wrote: >>>>>>>> > >> > One approach that’s helped me reason about all this is to >>>>>>>> treat each >>>>>>>> > >> base >>>>>>>> > >> > file as its own little mini‑table inside the larger table: >>>>>>>> the row >>>>>>>> > >> range of >>>>>>>> > >> > the base file keyed by row_id, and column files/deletes just >>>>>>>> layer on >>>>>>>> > >> top.Once >>>>>>>> > >> > a row is deleted in that mini‑table, it stays deleted in that >>>>>>>> > >> mini‑table’s >>>>>>>> > >> > state (whether that’s via equality deletes, or DVs), and >>>>>>>> column updates >>>>>>>> > >> are >>>>>>>> > >> > just layering changed or additional columns on top of >>>>>>>> whatever rowsare >>>>>>>> > >> > still there. Then I can reason about "what are desirable >>>>>>>> properties of >>>>>>>> > >> this >>>>>>>> > >> > mini-table". >>>>>>>> > >> > >>>>>>>> > >> > Once I look at it that way, stacking equality deletes with >>>>>>>> column >>>>>>>> > >> updates >>>>>>>> > >> > on the same column, and then forcing the write path to read >>>>>>>> all the >>>>>>>> > >> older >>>>>>>> > >> > column files when producing new column updates, feels like >>>>>>>> the worst >>>>>>>> > >> > outcome; and it gets worse the more column updates there are >>>>>>>> for the >>>>>>>> > >> > column. It blows up complexity and performance and >>>>>>>> compromises the >>>>>>>> > >> value of >>>>>>>> > >> > efficient column updates. >>>>>>>> > >> > >>>>>>>> > >> > If we eliminate that option, I think we’re left with two >>>>>>>> high‑level >>>>>>>> > >> > approaches: >>>>>>>> > >> > >>>>>>>> > >> > 1. Equality deletes cannot be allowed with column >>>>>>>> updates. This >>>>>>>> > >> > simplifies both the read and write paths when column >>>>>>>> update files are >>>>>>>> > >> > present. I would generally prefer this option but there >>>>>>>> is a >>>>>>>> > >> legitimate >>>>>>>> > >> > problem around the “how” for checking for the presence >>>>>>>> equality >>>>>>>> > >> deletes. We >>>>>>>> > >> > can’t rely on snapshot summaries, which means we’d have >>>>>>>> to look at >>>>>>>> > >> delete >>>>>>>> > >> > manifests to really know if equality deletes exist. There >>>>>>>> were ideas >>>>>>>> > >> in the >>>>>>>> > >> > V4 AMT sync about constraining equality deletes to be in >>>>>>>> the root >>>>>>>> > >> manifest; >>>>>>>> > >> > in that model, the amount of work needed to check for >>>>>>>> equality >>>>>>>> > >> deletes is >>>>>>>> > >> > bounded by the root size. I’d keep that as a separate >>>>>>>> open question >>>>>>>> > >> because >>>>>>>> > >> > there are other challenges with requiring equality >>>>>>>> deletes to only >>>>>>>> > >> appear >>>>>>>> > >> > in the root manifest, especially on the upgrade path. >>>>>>>> > >> > 2. After an equality delete, subsequent updates must >>>>>>>> produce a DV. As >>>>>>>> > >> > Xiening highlighted, once you’ve had an equality delete >>>>>>>> on a column, >>>>>>>> > >> any >>>>>>>> > >> > subsequent updates on that column would be required to >>>>>>>> produce a DV >>>>>>>> > >> (or >>>>>>>> > >> > positional delete) for the deleted positions at the new >>>>>>>> sequence >>>>>>>> > >> number, >>>>>>>> > >> > making the original equality delete obsolete. This is >>>>>>>> attractive >>>>>>>> > >> because >>>>>>>> > >> > it’s not too constraining for writers: they’re already >>>>>>>> doing the >>>>>>>> > >> work of >>>>>>>> > >> > reconciling deleted positions to decide what to write >>>>>>>> into the >>>>>>>> > >> column file, >>>>>>>> > >> > so the additional work is basically emitting the DV. The >>>>>>>> main thing >>>>>>>> > >> to >>>>>>>> > >> > think through is how exactly the plumbing to engines >>>>>>>> looks, but in >>>>>>>> > >> theory >>>>>>>> > >> > it’s just a matter of plumbing through explicitly deleted >>>>>>>> positions >>>>>>>> > >> (or, >>>>>>>> > >> > less ideally, inferring them from a sentinel value in the >>>>>>>> tuple). >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> > >> > So far I’m leaning towards option 2, but we should develop >>>>>>>> some >>>>>>>> > >> > concreteness around how feasible it is for engines to >>>>>>>> produce the DVs on >>>>>>>> > >> > the column update. Again, should all be theoretically >>>>>>>> possible based off >>>>>>>> > >> > plumbing deleted positions; we shouldn't let implementations >>>>>>>> drive the >>>>>>>> > >> spec >>>>>>>> > >> > but I think sniff testing the practicality of it is well >>>>>>>> worth it to >>>>>>>> > >> make >>>>>>>> > >> > sure that restriction is reasonably implementable. >>>>>>>> > >> > >>>>>>>> > >> > Interested in hearing what others think about this one. >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> > >> > Thanks, >>>>>>>> > >> > >>>>>>>> > >> > Amogh Jahagirdar >>>>>>>> > >> > >>>>>>>> > >> >>>>>>>> > > >>>>>>>> > >>>>>>>> >>>>>>>
