Thanks for clarification Ryan. I'm aware of the major changes, but I find it hard to go through all the related descriptions which are scattered all over the place.
Manu On Tue, May 6, 2025 at 11:24 PM Ryan Blue <rdb...@gmail.com> wrote: > Manu, > > We aren't currently voting. We are discussing any outstanding items to > address before we close v3 to further changes and adopt the existing v3 > changes. Right now, the open item is to clarify NaN behavior in geometry > and geography, PR #12956 <https://github.com/apache/iceberg/pull/12956>. > > Thanks for noting that the row lineage changes should be added to the > appendix, I'll open a PR to add it. That appendix is an area to highlight > things that have changed across versions, but an omission does not alter > the requirements elsewhere the spec. The changes we are discussing are the > things that are noted as part of v3 in the spec. The major additions are > new types, DVs, and row lineage. > > Ryan > > On Tue, May 6, 2025 at 3:32 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > >> I'm wondering what changes we are voting for here. Is it everything >> related to >> https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities >> from >> the table spec? >> How about changes to other specs? >> >> Do we summarize all the changes in >> https://iceberg.apache.org/spec/#appendix-e-format-version-changes? It >> looks row lineage is missing here. >> >> Thanks, >> Manu >> >> On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <aokolnyc...@gmail.com> >> wrote: >> >>> DVs in Spark seem to behave reasonably, serving as a reference >>> implementation of the V3 spec. There are areas for optimization/refinement >>> but nothing was observed that requires changing the spec. I would also like >>> to add the notion of content overhead/metadata (for Puffin/Parquet footers) >>> to manifests to optimize DVs maintenance. That said, it is optional >>> information and can be added after finalizing V3. >>> >>> - Anton >>> >>> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net> пише: >>> >>>> Hi Ryan >>>> >>>> All good for the spec. The idea for release is just a help to "double >>>> check" the spec is good (we already saw some slightly changes on the >>>> spec while working on release). I think we can be "confident" that we >>>> won't have unexpected change. >>>> >>>> Thanks ! >>>> Regards >>>> JB >>>> >>>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote: >>>> > >>>> > Thanks, everyone! Looks like there are a few points to discuss. >>>> > >>>> > [JB] Maybe a release with the core updated before announcing spec v3 >>>> officially would be a good idea ? >>>> > [Manu] Agree with Russell and JB that we make a “RC” release for V3 >>>> spec to test implementations, compatibility, etc before finalizing it. >>>> > >>>> > As Fokko noted, we are currently concerned about the spec and not >>>> implementations. The reason is that implementation work before the spec is >>>> finalized is to reduce risk and build confidence that the spec is complete >>>> and correct. Once that’s done, it is important to finalize the changes. If >>>> we don’t finalize the changes, then implementations don’t know how/what >>>> build and cannot plan when they will fully support v3 — because it could >>>> change. Most of the work in other implementations will take place after the >>>> spec is adopted. >>>> > >>>> > Our process for building confidence in new spec versions is to update >>>> the spec with pending changes, implement them to validate (and clarify or >>>> adjust as needed), and vote to adopt the new version as a confirmation that >>>> we agree that the spec changes are reasonable and correct. >>>> > >>>> > We’ve already voted to accept the pending v3 changes into the spec, >>>> so the changes have already been in a candidate state for quite some time >>>> to work on implementations. Now we’re at the point where we’ve implemented >>>> the features and, in my opinion, have demonstrated the spec changes are >>>> correct and complete. >>>> > >>>> > To that end, the question I’m raising in this thread is “what areas >>>> and features need further validation?” >>>> > >>>> > I appreciate the ideas here — releasing will assist other >>>> implementations — but I don’t think that changes the question for this >>>> thread. The aim is to identify specific risks and blockers that we need to >>>> tackle before adopting the changes. >>>> > >>>> > [Russell] We should probably come to a resolution on the compressed >>>> metadata.json name as well, although that’s mostly retroactive. V3 would be >>>> the place where we could officially change the naming convention. >>>> > >>>> > I don’t think that this affects v3, but we should agree before moving >>>> on. The only part of the spec that would depend on this is the paths used >>>> by file system tables and that strategy is deprecated. We should only >>>> document for clarify (we can’t change it) and I think we can do that any >>>> time. >>>> > >>>> > For the conventions used in catalog tables, I don’t think that we >>>> want to have requirements in the spec for file naming. We’ve avoided that >>>> in the past and it isn’t needed. It’s nice to have a convention in >>>> implementation notes, but there are other ways to handle this like magic >>>> bytes and catalog tracking. >>>> > >>>> > [Gang] it is implicit and obvious that only bucket transform can >>>> apply to multi-arg transform, it is still unclear the order of source >>>> columns and algorithm to use to calculate the bucket value >>>> > >>>> > I think there is some confusion here, but Fokko may have already >>>> cleared it up. >>>> > >>>> > Right now, there are no multi-argument transforms in the spec. We >>>> have discussed adding a multi-argument bucket function, but there is not >>>> currently one in the spec. In order to minimize changes required for v3, we >>>> opted to update the spec to allow adding new transforms in a >>>> forward-compatible way between major spec versions (implementations must >>>> ignore unknown transforms). >>>> > >>>> > [Jia] We’re currently addressing the handling of null/NaN values for >>>> X, Y, Z, and M coordinates in the Parquet format repository >>>> > >>>> > I agree that this is a good thing to clarify. We currently state that >>>> the ranges are [-180, 180] and [-90, 90] for geography, but we should state >>>> how points with NaN values are handled. >>>> > >>>> > >>>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> >> >>>> >> Hi Jia >>>> >> >>>> >> I feel it would be nice to get that Parquet spec clarificiation >>>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec >>>> as well, once we finalize that. >>>> >> >>>> >> Thanks >>>> >> Szehon >>>> >> >>>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: >>>> >>> >>>> >>> Hi Szehon, >>>> >>> >>>> >>> Thanks for clarifying it. >>>> >>> >>>> >>> We’re currently addressing the handling of null/NaN values for X, >>>> Y, Z, and M coordinates in the Parquet format repository. We’ve already >>>> concluded that the spec of Parquet (same on the Iceberg side I believe) >>>> only needs additional clarification to guide expected behavior: >>>> https://github.com/apache/parquet-format/pull/494 >>>> >>> >>>> >>> BTW the Parquet Geo C++ PR has been merged today: >>>> https://github.com/apache/arrow/pull/45459 I believe the Parquet Geo >>>> Java PR is also very close. >>>> >>> >>>> >>> Thanks, >>>> >>> Jia >>>> >>> >>>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> >>>> wrote: >>>> >>>> >>>> >>>> Hey Ryan, >>>> >>>> >>>> >>>> Thanks for raising this, and I'm very excited to see V3 being >>>> finalized! >>>> >>>> >>>> >>>>> The v3 spec for multi-arg transform only advises to use >>>> `source-ids` instead of `source-id`. Although it is implicit and obvious >>>> that only bucket transform can apply to multi-arg transform, it is still >>>> unclear the order of source columns and algorithm to use to calculate the >>>> bucket value. >>>> >>>> >>>> >>>> >>>> >>>> V3 now uses source IDs when there are multiple arguments and >>>> source IDs when there is just one. PR can be found here. This makes the >>>> serialization deterministic without knowing the format-version, simplifying >>>> the readers/writers. After some discussion on the PR, we've decided to >>>> leave out the multi-arg bucket transform so the V3 spec can be finalized. >>>> So V3 only contains the scaffolding for multi-arg transforms. >>>> >>>> >>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >>>> bounds and geospatial predicate to be merged: >>>> https://github.com/apache/iceberg/pull/12667 >>>> >>>> >>>> >>>> >>>> >>>> I think it is a good idea to distinguish between the spec and the >>>> actual code. If we all feel comfortable with the spec, I think we could >>>> finalize it. Being comfortable also means that we know that we have a >>>> working implementation, but I don't think we have to wrap up all the loose >>>> ends before voting on the spec. >>>> >>>> >>>> >>>> At the PyIceberg side, we're also working to catch up on the V3 >>>> capabilities. Having a Java release that exposes these capabilities helps, >>>> so we can do round-trip validation. >>>> >>>> >>>> >>>> Kind regards, >>>> >>>> Fokko >>>> >>>> >>>> >>>> >>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >>>> >>>>> >>>> >>>>> Hi folks, >>>> >>>>> >>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >>>> bounds and geospatial predicate to be merged: >>>> https://github.com/apache/iceberg/pull/12667 >>>> >>>>> >>>> >>>>> Should a release with core updates include this PR? >>>> >>>>> >>>> >>>>> Thanks, >>>> >>>>> Jia >>>> >>>>> >>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang < >>>> owenzhang1...@gmail.com> wrote: >>>> >>>>>> >>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 >>>> spec to test implementations, compatibility, etc before finalizing it. >>>> >>>>>> >>>> >>>>>> Thanks, >>>> >>>>>> Manu >>>> >>>>>> >>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré < >>>> j...@nanthrax.net> wrote: >>>> >>>>>>> >>>> >>>>>>> Hi Ryan >>>> >>>>>>> >>>> >>>>>>> It sounds good. >>>> >>>>>>> >>>> >>>>>>> About multi-args transforms, with the clarification we did a >>>> couple of weeks ago, I think we are good. >>>> >>>>>>> Maybe a release with the core updated before announcing spec v3 >>>> officially would be a good idea ? >>>> >>>>>>> >>>> >>>>>>> Regards >>>> >>>>>>> JB >>>> >>>>>>> >>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a >>>> écrit : >>>> >>>>>>>> >>>> >>>>>>>> Hi everyone, >>>> >>>>>>>> >>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize >>>> and adopt the changes for Iceberg v3. We’ve been working toward this for >>>> the last few months and have now implemented the v3 features in the Java >>>> library to reduce the risk of needing changes or hitting problems (row >>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated some >>>> clarifications and minor changes back into the spec from what we’ve >>>> learned. >>>> >>>>>>>> >>>> >>>>>>>> At this point, I’m confident that the spec is reasonable and >>>> correct. Thank you to everyone working on these reference implementations! >>>> >>>>>>>> >>>> >>>>>>>> The next step is to discuss any outstanding items or concerns >>>> about moving forward, and then to have a vote thread to adopt the spec. >>>> I’ll start off with a couple of items: >>>> >>>>>>>> >>>> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t >>>> yet been finalized by the Parquet community, but we’ve built a full, >>>> independent implementation in Iceberg to validate the spec. I think the >>>> Parquet community is primarily waiting on getting the PRs in to have a Java >>>> reference implementation, so the risk of changes to the Variant spec is >>>> small. >>>> >>>>>>>> >>>> >>>>>>>> There’s also an on-going vote to add encryption keys in >>>> support of full table encryption that I think we want to get in. >>>> >>>>>>>> >>>> >>>>>>>> Any other items we may want to clear up? >>>> >>>>>>>> >>>> >>>>>>>> Ryan >>>> >>>