Hi Ryan All good for the spec. The idea for release is just a help to "double check" the spec is good (we already saw some slightly changes on the spec while working on release). I think we can be "confident" that we won't have unexpected change.
Thanks ! Regards JB On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote: > > Thanks, everyone! Looks like there are a few points to discuss. > > [JB] Maybe a release with the core updated before announcing spec v3 > officially would be a good idea ? > [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec to > test implementations, compatibility, etc before finalizing it. > > As Fokko noted, we are currently concerned about the spec and not > implementations. The reason is that implementation work before the spec is > finalized is to reduce risk and build confidence that the spec is complete > and correct. Once that’s done, it is important to finalize the changes. If we > don’t finalize the changes, then implementations don’t know how/what build > and cannot plan when they will fully support v3 — because it could change. > Most of the work in other implementations will take place after the spec is > adopted. > > Our process for building confidence in new spec versions is to update the > spec with pending changes, implement them to validate (and clarify or adjust > as needed), and vote to adopt the new version as a confirmation that we agree > that the spec changes are reasonable and correct. > > We’ve already voted to accept the pending v3 changes into the spec, so the > changes have already been in a candidate state for quite some time to work on > implementations. Now we’re at the point where we’ve implemented the features > and, in my opinion, have demonstrated the spec changes are correct and > complete. > > To that end, the question I’m raising in this thread is “what areas and > features need further validation?” > > I appreciate the ideas here — releasing will assist other implementations — > but I don’t think that changes the question for this thread. The aim is to > identify specific risks and blockers that we need to tackle before adopting > the changes. > > [Russell] We should probably come to a resolution on the compressed > metadata.json name as well, although that’s mostly retroactive. V3 would be > the place where we could officially change the naming convention. > > I don’t think that this affects v3, but we should agree before moving on. The > only part of the spec that would depend on this is the paths used by file > system tables and that strategy is deprecated. We should only document for > clarify (we can’t change it) and I think we can do that any time. > > For the conventions used in catalog tables, I don’t think that we want to > have requirements in the spec for file naming. We’ve avoided that in the past > and it isn’t needed. It’s nice to have a convention in implementation notes, > but there are other ways to handle this like magic bytes and catalog tracking. > > [Gang] it is implicit and obvious that only bucket transform can apply to > multi-arg transform, it is still unclear the order of source columns and > algorithm to use to calculate the bucket value > > I think there is some confusion here, but Fokko may have already cleared it > up. > > Right now, there are no multi-argument transforms in the spec. We have > discussed adding a multi-argument bucket function, but there is not currently > one in the spec. In order to minimize changes required for v3, we opted to > update the spec to allow adding new transforms in a forward-compatible way > between major spec versions (implementations must ignore unknown transforms). > > [Jia] We’re currently addressing the handling of null/NaN values for X, Y, Z, > and M coordinates in the Parquet format repository > > I agree that this is a good thing to clarify. We currently state that the > ranges are [-180, 180] and [-90, 90] for geography, but we should state how > points with NaN values are handled. > > > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> wrote: >> >> Hi Jia >> >> I feel it would be nice to get that Parquet spec clarificiation >> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as >> well, once we finalize that. >> >> Thanks >> Szehon >> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: >>> >>> Hi Szehon, >>> >>> Thanks for clarifying it. >>> >>> We’re currently addressing the handling of null/NaN values for X, Y, Z, and >>> M coordinates in the Parquet format repository. We’ve already concluded >>> that the spec of Parquet (same on the Iceberg side I believe) only needs >>> additional clarification to guide expected behavior: >>> https://github.com/apache/parquet-format/pull/494 >>> >>> BTW the Parquet Geo C++ PR has been merged today: >>> https://github.com/apache/arrow/pull/45459 I believe the Parquet Geo Java >>> PR is also very close. >>> >>> Thanks, >>> Jia >>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> wrote: >>>> >>>> Hey Ryan, >>>> >>>> Thanks for raising this, and I'm very excited to see V3 being finalized! >>>> >>>>> The v3 spec for multi-arg transform only advises to use `source-ids` >>>>> instead of `source-id`. Although it is implicit and obvious that only >>>>> bucket transform can apply to multi-arg transform, it is still unclear >>>>> the order of source columns and algorithm to use to calculate the bucket >>>>> value. >>>> >>>> >>>> V3 now uses source IDs when there are multiple arguments and source IDs >>>> when there is just one. PR can be found here. This makes the serialization >>>> deterministic without knowing the format-version, simplifying the >>>> readers/writers. After some discussion on the PR, we've decided to leave >>>> out the multi-arg bucket transform so the V3 spec can be finalized. So V3 >>>> only contains the scaffolding for multi-arg transforms. >>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds and >>>>> geospatial predicate to be merged: >>>>> https://github.com/apache/iceberg/pull/12667 >>>> >>>> >>>> I think it is a good idea to distinguish between the spec and the actual >>>> code. If we all feel comfortable with the spec, I think we could finalize >>>> it. Being comfortable also means that we know that we have a working >>>> implementation, but I don't think we have to wrap up all the loose ends >>>> before voting on the spec. >>>> >>>> At the PyIceberg side, we're also working to catch up on the V3 >>>> capabilities. Having a Java release that exposes these capabilities helps, >>>> so we can do round-trip validation. >>>> >>>> Kind regards, >>>> Fokko >>>> >>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >>>>> >>>>> Hi folks, >>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds and >>>>> geospatial predicate to be merged: >>>>> https://github.com/apache/iceberg/pull/12667 >>>>> >>>>> Should a release with core updates include this PR? >>>>> >>>>> Thanks, >>>>> Jia >>>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to >>>>>> test implementations, compatibility, etc before finalizing it. >>>>>> >>>>>> Thanks, >>>>>> Manu >>>>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré >>>>>> <j...@nanthrax.net> wrote: >>>>>>> >>>>>>> Hi Ryan >>>>>>> >>>>>>> It sounds good. >>>>>>> >>>>>>> About multi-args transforms, with the clarification we did a couple of >>>>>>> weeks ago, I think we are good. >>>>>>> Maybe a release with the core updated before announcing spec v3 >>>>>>> officially would be a good idea ? >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit : >>>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize and adopt >>>>>>>> the changes for Iceberg v3. We’ve been working toward this for the >>>>>>>> last few months and have now implemented the v3 features in the Java >>>>>>>> library to reduce the risk of needing changes or hitting problems (row >>>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated >>>>>>>> some clarifications and minor changes back into the spec from what >>>>>>>> we’ve learned. >>>>>>>> >>>>>>>> At this point, I’m confident that the spec is reasonable and correct. >>>>>>>> Thank you to everyone working on these reference implementations! >>>>>>>> >>>>>>>> The next step is to discuss any outstanding items or concerns about >>>>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll >>>>>>>> start off with a couple of items: >>>>>>>> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t yet >>>>>>>> been finalized by the Parquet community, but we’ve built a full, >>>>>>>> independent implementation in Iceberg to validate the spec. I think >>>>>>>> the Parquet community is primarily waiting on getting the PRs in to >>>>>>>> have a Java reference implementation, so the risk of changes to the >>>>>>>> Variant spec is small. >>>>>>>> >>>>>>>> There’s also an on-going vote to add encryption keys in support of >>>>>>>> full table encryption that I think we want to get in. >>>>>>>> >>>>>>>> Any other items we may want to clear up? >>>>>>>> >>>>>>>> Ryan