Steven, that may be a good point to add to ensure the metadata is properly maintained. If I remember correctly, the Spark implementation already drops old DVs in DELETE/UPDATE/MERGE but the data compaction wasn't doing it originally. I wonder if we fixed it. Eduard may know more.
- Anton ср, 7 трав. 2025 р. о 16:29 Steven Wu <stevenz...@gmail.com> пише: > For the delete vection change, should we add the following > constraint/requirement for the write path in the spec? I don't know if this > is already the behavior of the Spark implementation. > > "if a data file is removed from the table, the corresponding DV reference > must also be removed from delete manifest file" > > This constraint is to guarantee no orphaned DVs in the table state. It > will be cheaper to calculate *accurate* table row count. Just iterate > through the manifest files (data and delete) using add and subtraction > calculations. There is no need to validate DVs if the referenced data files > are still part of the table, which can be a little more expensive. > > > > On Tue, May 6, 2025 at 9:18 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > >> Thanks for clarification Ryan. >> >> I'm aware of the major changes, but I find it hard to go through all the >> related descriptions which are scattered all over the place. >> >> Manu >> >> On Tue, May 6, 2025 at 11:24 PM Ryan Blue <rdb...@gmail.com> wrote: >> >>> Manu, >>> >>> We aren't currently voting. We are discussing any outstanding items to >>> address before we close v3 to further changes and adopt the existing v3 >>> changes. Right now, the open item is to clarify NaN behavior in geometry >>> and geography, PR #12956 <https://github.com/apache/iceberg/pull/12956>. >>> >>> Thanks for noting that the row lineage changes should be added to the >>> appendix, I'll open a PR to add it. That appendix is an area to highlight >>> things that have changed across versions, but an omission does not alter >>> the requirements elsewhere the spec. The changes we are discussing are the >>> things that are noted as part of v3 in the spec. The major additions are >>> new types, DVs, and row lineage. >>> >>> Ryan >>> >>> On Tue, May 6, 2025 at 3:32 AM Manu Zhang <owenzhang1...@gmail.com> >>> wrote: >>> >>>> I'm wondering what changes we are voting for here. Is it everything >>>> related to >>>> https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities >>>> from >>>> the table spec? >>>> How about changes to other specs? >>>> >>>> Do we summarize all the changes in >>>> https://iceberg.apache.org/spec/#appendix-e-format-version-changes? It >>>> looks row lineage is missing here. >>>> >>>> Thanks, >>>> Manu >>>> >>>> On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <aokolnyc...@gmail.com> >>>> wrote: >>>> >>>>> DVs in Spark seem to behave reasonably, serving as a reference >>>>> implementation of the V3 spec. There are areas for optimization/refinement >>>>> but nothing was observed that requires changing the spec. I would also >>>>> like >>>>> to add the notion of content overhead/metadata (for Puffin/Parquet >>>>> footers) >>>>> to manifests to optimize DVs maintenance. That said, it is optional >>>>> information and can be added after finalizing V3. >>>>> >>>>> - Anton >>>>> >>>>> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net> >>>>> пише: >>>>> >>>>>> Hi Ryan >>>>>> >>>>>> All good for the spec. The idea for release is just a help to "double >>>>>> check" the spec is good (we already saw some slightly changes on the >>>>>> spec while working on release). I think we can be "confident" that we >>>>>> won't have unexpected change. >>>>>> >>>>>> Thanks ! >>>>>> Regards >>>>>> JB >>>>>> >>>>>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>> > >>>>>> > Thanks, everyone! Looks like there are a few points to discuss. >>>>>> > >>>>>> > [JB] Maybe a release with the core updated before announcing spec >>>>>> v3 officially would be a good idea ? >>>>>> > [Manu] Agree with Russell and JB that we make a “RC” release for V3 >>>>>> spec to test implementations, compatibility, etc before finalizing it. >>>>>> > >>>>>> > As Fokko noted, we are currently concerned about the spec and not >>>>>> implementations. The reason is that implementation work before the spec >>>>>> is >>>>>> finalized is to reduce risk and build confidence that the spec is >>>>>> complete >>>>>> and correct. Once that’s done, it is important to finalize the changes. >>>>>> If >>>>>> we don’t finalize the changes, then implementations don’t know how/what >>>>>> build and cannot plan when they will fully support v3 — because it could >>>>>> change. Most of the work in other implementations will take place after >>>>>> the >>>>>> spec is adopted. >>>>>> > >>>>>> > Our process for building confidence in new spec versions is to >>>>>> update the spec with pending changes, implement them to validate (and >>>>>> clarify or adjust as needed), and vote to adopt the new version as a >>>>>> confirmation that we agree that the spec changes are reasonable and >>>>>> correct. >>>>>> > >>>>>> > We’ve already voted to accept the pending v3 changes into the spec, >>>>>> so the changes have already been in a candidate state for quite some time >>>>>> to work on implementations. Now we’re at the point where we’ve >>>>>> implemented >>>>>> the features and, in my opinion, have demonstrated the spec changes are >>>>>> correct and complete. >>>>>> > >>>>>> > To that end, the question I’m raising in this thread is “what areas >>>>>> and features need further validation?” >>>>>> > >>>>>> > I appreciate the ideas here — releasing will assist other >>>>>> implementations — but I don’t think that changes the question for this >>>>>> thread. The aim is to identify specific risks and blockers that we need >>>>>> to >>>>>> tackle before adopting the changes. >>>>>> > >>>>>> > [Russell] We should probably come to a resolution on the compressed >>>>>> metadata.json name as well, although that’s mostly retroactive. V3 would >>>>>> be >>>>>> the place where we could officially change the naming convention. >>>>>> > >>>>>> > I don’t think that this affects v3, but we should agree before >>>>>> moving on. The only part of the spec that would depend on this is the >>>>>> paths >>>>>> used by file system tables and that strategy is deprecated. We should >>>>>> only >>>>>> document for clarify (we can’t change it) and I think we can do that any >>>>>> time. >>>>>> > >>>>>> > For the conventions used in catalog tables, I don’t think that we >>>>>> want to have requirements in the spec for file naming. We’ve avoided that >>>>>> in the past and it isn’t needed. It’s nice to have a convention in >>>>>> implementation notes, but there are other ways to handle this like magic >>>>>> bytes and catalog tracking. >>>>>> > >>>>>> > [Gang] it is implicit and obvious that only bucket transform can >>>>>> apply to multi-arg transform, it is still unclear the order of source >>>>>> columns and algorithm to use to calculate the bucket value >>>>>> > >>>>>> > I think there is some confusion here, but Fokko may have already >>>>>> cleared it up. >>>>>> > >>>>>> > Right now, there are no multi-argument transforms in the spec. We >>>>>> have discussed adding a multi-argument bucket function, but there is not >>>>>> currently one in the spec. In order to minimize changes required for v3, >>>>>> we >>>>>> opted to update the spec to allow adding new transforms in a >>>>>> forward-compatible way between major spec versions (implementations must >>>>>> ignore unknown transforms). >>>>>> > >>>>>> > [Jia] We’re currently addressing the handling of null/NaN values >>>>>> for X, Y, Z, and M coordinates in the Parquet format repository >>>>>> > >>>>>> > I agree that this is a good thing to clarify. We currently state >>>>>> that the ranges are [-180, 180] and [-90, 90] for geography, but we >>>>>> should >>>>>> state how points with NaN values are handled. >>>>>> > >>>>>> > >>>>>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >> >>>>>> >> Hi Jia >>>>>> >> >>>>>> >> I feel it would be nice to get that Parquet spec clarificiation >>>>>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 >>>>>> spec as well, once we finalize that. >>>>>> >> >>>>>> >> Thanks >>>>>> >> Szehon >>>>>> >> >>>>>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: >>>>>> >>> >>>>>> >>> Hi Szehon, >>>>>> >>> >>>>>> >>> Thanks for clarifying it. >>>>>> >>> >>>>>> >>> We’re currently addressing the handling of null/NaN values for X, >>>>>> Y, Z, and M coordinates in the Parquet format repository. We’ve already >>>>>> concluded that the spec of Parquet (same on the Iceberg side I believe) >>>>>> only needs additional clarification to guide expected behavior: >>>>>> https://github.com/apache/parquet-format/pull/494 >>>>>> >>> >>>>>> >>> BTW the Parquet Geo C++ PR has been merged today: >>>>>> https://github.com/apache/arrow/pull/45459 I believe the Parquet >>>>>> Geo Java PR is also very close. >>>>>> >>> >>>>>> >>> Thanks, >>>>>> >>> Jia >>>>>> >>> >>>>>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong < >>>>>> fo...@apache.org> wrote: >>>>>> >>>> >>>>>> >>>> Hey Ryan, >>>>>> >>>> >>>>>> >>>> Thanks for raising this, and I'm very excited to see V3 being >>>>>> finalized! >>>>>> >>>> >>>>>> >>>>> The v3 spec for multi-arg transform only advises to use >>>>>> `source-ids` instead of `source-id`. Although it is implicit and obvious >>>>>> that only bucket transform can apply to multi-arg transform, it is still >>>>>> unclear the order of source columns and algorithm to use to calculate the >>>>>> bucket value. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> V3 now uses source IDs when there are multiple arguments and >>>>>> source IDs when there is just one. PR can be found here. This makes the >>>>>> serialization deterministic without knowing the format-version, >>>>>> simplifying >>>>>> the readers/writers. After some discussion on the PR, we've decided to >>>>>> leave out the multi-arg bucket transform so the V3 spec can be finalized. >>>>>> So V3 only contains the scaffolding for multi-arg transforms. >>>>>> >>>> >>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >>>>>> bounds and geospatial predicate to be merged: >>>>>> https://github.com/apache/iceberg/pull/12667 >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> I think it is a good idea to distinguish between the spec and >>>>>> the actual code. If we all feel comfortable with the spec, I think we >>>>>> could >>>>>> finalize it. Being comfortable also means that we know that we have a >>>>>> working implementation, but I don't think we have to wrap up all the >>>>>> loose >>>>>> ends before voting on the spec. >>>>>> >>>> >>>>>> >>>> At the PyIceberg side, we're also working to catch up on the V3 >>>>>> capabilities. Having a Java release that exposes these capabilities >>>>>> helps, >>>>>> so we can do round-trip validation. >>>>>> >>>> >>>>>> >>>> Kind regards, >>>>>> >>>> Fokko >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >>>>>> >>>>> >>>>>> >>>>> Hi folks, >>>>>> >>>>> >>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >>>>>> bounds and geospatial predicate to be merged: >>>>>> https://github.com/apache/iceberg/pull/12667 >>>>>> >>>>> >>>>>> >>>>> Should a release with core updates include this PR? >>>>>> >>>>> >>>>>> >>>>> Thanks, >>>>>> >>>>> Jia >>>>>> >>>>> >>>>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang < >>>>>> owenzhang1...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 >>>>>> spec to test implementations, compatibility, etc before finalizing it. >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Manu >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré < >>>>>> j...@nanthrax.net> wrote: >>>>>> >>>>>>> >>>>>> >>>>>>> Hi Ryan >>>>>> >>>>>>> >>>>>> >>>>>>> It sounds good. >>>>>> >>>>>>> >>>>>> >>>>>>> About multi-args transforms, with the clarification we did a >>>>>> couple of weeks ago, I think we are good. >>>>>> >>>>>>> Maybe a release with the core updated before announcing spec >>>>>> v3 officially would be a good idea ? >>>>>> >>>>>>> >>>>>> >>>>>>> Regards >>>>>> >>>>>>> JB >>>>>> >>>>>>> >>>>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a >>>>>> écrit : >>>>>> >>>>>>>> >>>>>> >>>>>>>> Hi everyone, >>>>>> >>>>>>>> >>>>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize >>>>>> and adopt the changes for Iceberg v3. We’ve been working toward this for >>>>>> the last few months and have now implemented the v3 features in the Java >>>>>> library to reduce the risk of needing changes or hitting problems (row >>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated some >>>>>> clarifications and minor changes back into the spec from what we’ve >>>>>> learned. >>>>>> >>>>>>>> >>>>>> >>>>>>>> At this point, I’m confident that the spec is reasonable and >>>>>> correct. Thank you to everyone working on these reference >>>>>> implementations! >>>>>> >>>>>>>> >>>>>> >>>>>>>> The next step is to discuss any outstanding items or >>>>>> concerns about moving forward, and then to have a vote thread to adopt >>>>>> the >>>>>> spec. I’ll start off with a couple of items: >>>>>> >>>>>>>> >>>>>> >>>>>>>> One potential concern is that the upstream Variant spec >>>>>> hasn’t yet been finalized by the Parquet community, but we’ve built a >>>>>> full, >>>>>> independent implementation in Iceberg to validate the spec. I think the >>>>>> Parquet community is primarily waiting on getting the PRs in to have a >>>>>> Java >>>>>> reference implementation, so the risk of changes to the Variant spec is >>>>>> small. >>>>>> >>>>>>>> >>>>>> >>>>>>>> There’s also an on-going vote to add encryption keys in >>>>>> support of full table encryption that I think we want to get in. >>>>>> >>>>>>>> >>>>>> >>>>>>>> Any other items we may want to clear up? >>>>>> >>>>>>>> >>>>>> >>>>>>>> Ryan >>>>>> >>>>>