Sounds good to me, I think we can move ahead with this, for all intents and purposes I think we are past any breaking changes for Spec V3 and should consider it "stable" for implementation purposes. I want to work on some official descriptions of our spec versioning / library process to better explain this to outside users but that can happen orthogonally.
On Thu, May 1, 2025 at 12:05 PM Ryan Blue <rdb...@gmail.com> wrote: > Thanks, everyone! Looks like there are a few points to discuss. > > [JB] Maybe a release with the core updated before announcing spec v3 > officially would be a good idea ? > [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec > to test implementations, compatibility, etc before finalizing it. > > As Fokko noted, we are currently concerned about the spec and not > implementations. The reason is that implementation work before the spec is > finalized is to reduce risk and build confidence that the spec is complete > and correct. Once that’s done, it is important to finalize the changes. If > we don’t finalize the changes, then implementations don’t know how/what > build and cannot plan when they will fully support v3 — because it could > change. Most of the work in other implementations will take place after the > spec is adopted. > > Our process for building confidence in new spec versions is to update the > spec with pending changes, implement them to validate (and clarify or > adjust as needed), and vote to adopt the new version as a confirmation that > we agree that the spec changes are reasonable and correct. > > We’ve already voted to accept the pending v3 changes into the spec, so the > changes have already been in a candidate state for quite some time to work > on implementations. Now we’re at the point where we’ve implemented the > features and, in my opinion, have demonstrated the spec changes are correct > and complete. > > To that end, the question I’m raising in this thread is *“what areas and > features need further validation?”* > > I appreciate the ideas here — releasing will assist other implementations > — but I don’t think that changes the question for this thread. The aim is > to identify specific risks and blockers that we need to tackle before > adopting the changes. > > [Russell] We should probably come to a resolution on the compressed > metadata.json name as well, although that’s mostly retroactive. V3 would be > the place where we could officially change the naming convention. > > I don’t think that this affects v3, but we should agree before moving on. > The only part of the spec that would depend on this is the paths used by > file system tables and that strategy is deprecated. We should only document > for clarify (we can’t change it) and I think we can do that any time. > > For the conventions used in catalog tables, I don’t think that we want to > have requirements in the spec for file naming. We’ve avoided that in the > past and it isn’t needed. It’s nice to have a convention in implementation > notes, but there are other ways to handle this like magic bytes and catalog > tracking. > > [Gang] it is implicit and obvious that only bucket transform can apply to > multi-arg transform, it is still unclear the order of source columns and > algorithm to use to calculate the bucket value > > I think there is some confusion here, but Fokko may have already cleared > it up. > > Right now, there are no multi-argument transforms in the spec. We have > discussed adding a multi-argument bucket function, but there is not > currently one in the spec. In order to minimize changes required for v3, we > opted to update the spec to allow adding new transforms in a > forward-compatible way between major spec versions (implementations must > ignore unknown transforms). > > [Jia] We’re currently addressing the handling of null/NaN values for X, Y, > Z, and M coordinates in the Parquet format repository > > I agree that this is a good thing to clarify. We currently state that the > ranges are [-180, 180] and [-90, 90] for geography, but we should state how > points with NaN values are handled. > > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> Hi Jia >> >> I feel it would be nice to get that Parquet spec clarificiation >> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec >> as well, once we finalize that. >> >> Thanks >> Szehon >> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: >> >>> Hi Szehon, >>> >>> Thanks for clarifying it. >>> >>> We’re currently addressing the handling of null/NaN values for X, Y, Z, >>> and M coordinates in the Parquet format repository. We’ve already concluded >>> that the spec of Parquet (same on the Iceberg side I believe) only needs >>> additional clarification to guide expected behavior: >>> https://github.com/apache/parquet-format/pull/494 >>> >>> BTW the Parquet Geo C++ PR has been merged today: >>> https://github.com/apache/arrow/pull/45459 I believe the Parquet Geo >>> Java PR is also very close. >>> >>> Thanks, >>> Jia >>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> >>> wrote: >>> >>>> Hey Ryan, >>>> >>>> Thanks for raising this, and I'm very excited to see V3 being finalized! >>>> >>>> The v3 spec for multi-arg transform only advises to use `source-ids` >>>>> instead of `source-id`. Although it is implicit and obvious that only >>>>> bucket transform can apply to multi-arg transform, it is still unclear the >>>>> order of source columns and algorithm to use to calculate the bucket >>>>> value. >>>>> >>>> >>>> V3 now uses source IDs when there are multiple arguments and source IDs >>>> when there is just one. PR can be found here >>>> <https://github.com/apache/iceberg/pull/12644>. This makes the >>>> serialization deterministic without knowing the format-version, simplifying >>>> the readers/writers. After some discussion on the PR, we've decided to >>>> leave out the multi-arg bucket transform so the V3 spec can be finalized. >>>> So V3 only contains the scaffolding for multi-arg transforms. >>>> >>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds >>>>> and geospatial predicate to be merged: >>>>> https://github.com/apache/iceberg/pull/12667 >>>> >>>> >>>> I think it is a good idea to distinguish between the spec and the >>>> actual code. If we all feel comfortable with the spec, I think we could >>>> finalize it. Being comfortable also means that we know that we have a >>>> working implementation, but I don't think we have to wrap up all the loose >>>> ends before voting on the spec. >>>> >>>> At the PyIceberg side, we're also working to catch up on the V3 >>>> capabilities <https://github.com/apache/iceberg-python/issues/1818>. >>>> Having a Java release that exposes these capabilities helps, so we can do >>>> round-trip validation. >>>> >>>> Kind regards, >>>> Fokko >>>> >>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >>>> >>>>> Hi folks, >>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds >>>>> and geospatial predicate to be merged: >>>>> https://github.com/apache/iceberg/pull/12667 >>>>> >>>>> Should a release with core updates include this PR? >>>>> >>>>> Thanks, >>>>> Jia >>>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com> >>>>> wrote: >>>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to >>>>>> test implementations, compatibility, etc before finalizing it. >>>>>> >>>>>> Thanks, >>>>>> Manu >>>>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré < >>>>>> j...@nanthrax.net> wrote: >>>>>> >>>>>>> Hi Ryan >>>>>>> >>>>>>> It sounds good. >>>>>>> >>>>>>> About multi-args transforms, with the clarification we did a couple >>>>>>> of weeks ago, I think we are good. >>>>>>> Maybe a release with the core updated before announcing spec v3 >>>>>>> officially would be a good idea ? >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit : >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize and >>>>>>>> adopt the changes for Iceberg v3. We’ve been working toward this for >>>>>>>> the >>>>>>>> last few months and have now implemented the v3 features in the Java >>>>>>>> library to reduce the risk of needing changes or hitting problems (row >>>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated >>>>>>>> some >>>>>>>> clarifications and minor changes back into the spec from what we’ve >>>>>>>> learned. >>>>>>>> >>>>>>>> At this point, I’m confident that the spec is reasonable and >>>>>>>> correct. Thank you to everyone working on these reference >>>>>>>> implementations! >>>>>>>> >>>>>>>> The next step is to discuss any outstanding items or concerns about >>>>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll >>>>>>>> start off with a couple of items: >>>>>>>> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t yet >>>>>>>> been finalized by the Parquet community, but we’ve built a full, >>>>>>>> independent implementation in Iceberg to validate the spec. I think the >>>>>>>> Parquet community is primarily waiting on getting the PRs in to have a >>>>>>>> Java >>>>>>>> reference implementation, so the risk of changes to the Variant spec is >>>>>>>> small. >>>>>>>> >>>>>>>> There’s also an on-going vote to add encryption keys in support of >>>>>>>> full table encryption that I think we want to get in. >>>>>>>> >>>>>>>> Any other items we may want to clear up? >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>