I'm wondering what changes we are voting for here. Is it everything related to https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities from the table spec? How about changes to other specs?
Do we summarize all the changes in https://iceberg.apache.org/spec/#appendix-e-format-version-changes? It looks row lineage is missing here. Thanks, Manu On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <aokolnyc...@gmail.com> wrote: > DVs in Spark seem to behave reasonably, serving as a reference > implementation of the V3 spec. There are areas for optimization/refinement > but nothing was observed that requires changing the spec. I would also like > to add the notion of content overhead/metadata (for Puffin/Parquet footers) > to manifests to optimize DVs maintenance. That said, it is optional > information and can be added after finalizing V3. > > - Anton > > пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net> пише: > >> Hi Ryan >> >> All good for the spec. The idea for release is just a help to "double >> check" the spec is good (we already saw some slightly changes on the >> spec while working on release). I think we can be "confident" that we >> won't have unexpected change. >> >> Thanks ! >> Regards >> JB >> >> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote: >> > >> > Thanks, everyone! Looks like there are a few points to discuss. >> > >> > [JB] Maybe a release with the core updated before announcing spec v3 >> officially would be a good idea ? >> > [Manu] Agree with Russell and JB that we make a “RC” release for V3 >> spec to test implementations, compatibility, etc before finalizing it. >> > >> > As Fokko noted, we are currently concerned about the spec and not >> implementations. The reason is that implementation work before the spec is >> finalized is to reduce risk and build confidence that the spec is complete >> and correct. Once that’s done, it is important to finalize the changes. If >> we don’t finalize the changes, then implementations don’t know how/what >> build and cannot plan when they will fully support v3 — because it could >> change. Most of the work in other implementations will take place after the >> spec is adopted. >> > >> > Our process for building confidence in new spec versions is to update >> the spec with pending changes, implement them to validate (and clarify or >> adjust as needed), and vote to adopt the new version as a confirmation that >> we agree that the spec changes are reasonable and correct. >> > >> > We’ve already voted to accept the pending v3 changes into the spec, so >> the changes have already been in a candidate state for quite some time to >> work on implementations. Now we’re at the point where we’ve implemented the >> features and, in my opinion, have demonstrated the spec changes are correct >> and complete. >> > >> > To that end, the question I’m raising in this thread is “what areas and >> features need further validation?” >> > >> > I appreciate the ideas here — releasing will assist other >> implementations — but I don’t think that changes the question for this >> thread. The aim is to identify specific risks and blockers that we need to >> tackle before adopting the changes. >> > >> > [Russell] We should probably come to a resolution on the compressed >> metadata.json name as well, although that’s mostly retroactive. V3 would be >> the place where we could officially change the naming convention. >> > >> > I don’t think that this affects v3, but we should agree before moving >> on. The only part of the spec that would depend on this is the paths used >> by file system tables and that strategy is deprecated. We should only >> document for clarify (we can’t change it) and I think we can do that any >> time. >> > >> > For the conventions used in catalog tables, I don’t think that we want >> to have requirements in the spec for file naming. We’ve avoided that in the >> past and it isn’t needed. It’s nice to have a convention in implementation >> notes, but there are other ways to handle this like magic bytes and catalog >> tracking. >> > >> > [Gang] it is implicit and obvious that only bucket transform can apply >> to multi-arg transform, it is still unclear the order of source columns and >> algorithm to use to calculate the bucket value >> > >> > I think there is some confusion here, but Fokko may have already >> cleared it up. >> > >> > Right now, there are no multi-argument transforms in the spec. We have >> discussed adding a multi-argument bucket function, but there is not >> currently one in the spec. In order to minimize changes required for v3, we >> opted to update the spec to allow adding new transforms in a >> forward-compatible way between major spec versions (implementations must >> ignore unknown transforms). >> > >> > [Jia] We’re currently addressing the handling of null/NaN values for X, >> Y, Z, and M coordinates in the Parquet format repository >> > >> > I agree that this is a good thing to clarify. We currently state that >> the ranges are [-180, 180] and [-90, 90] for geography, but we should state >> how points with NaN values are handled. >> > >> > >> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >> >> >> Hi Jia >> >> >> >> I feel it would be nice to get that Parquet spec clarificiation >> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec >> as well, once we finalize that. >> >> >> >> Thanks >> >> Szehon >> >> >> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: >> >>> >> >>> Hi Szehon, >> >>> >> >>> Thanks for clarifying it. >> >>> >> >>> We’re currently addressing the handling of null/NaN values for X, Y, >> Z, and M coordinates in the Parquet format repository. We’ve already >> concluded that the spec of Parquet (same on the Iceberg side I believe) >> only needs additional clarification to guide expected behavior: >> https://github.com/apache/parquet-format/pull/494 >> >>> >> >>> BTW the Parquet Geo C++ PR has been merged today: >> https://github.com/apache/arrow/pull/45459 I believe the Parquet Geo >> Java PR is also very close. >> >>> >> >>> Thanks, >> >>> Jia >> >>> >> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> >> wrote: >> >>>> >> >>>> Hey Ryan, >> >>>> >> >>>> Thanks for raising this, and I'm very excited to see V3 being >> finalized! >> >>>> >> >>>>> The v3 spec for multi-arg transform only advises to use >> `source-ids` instead of `source-id`. Although it is implicit and obvious >> that only bucket transform can apply to multi-arg transform, it is still >> unclear the order of source columns and algorithm to use to calculate the >> bucket value. >> >>>> >> >>>> >> >>>> V3 now uses source IDs when there are multiple arguments and source >> IDs when there is just one. PR can be found here. This makes the >> serialization deterministic without knowing the format-version, simplifying >> the readers/writers. After some discussion on the PR, we've decided to >> leave out the multi-arg bucket transform so the V3 spec can be finalized. >> So V3 only contains the scaffolding for multi-arg transforms. >> >>>> >> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >> bounds and geospatial predicate to be merged: >> https://github.com/apache/iceberg/pull/12667 >> >>>> >> >>>> >> >>>> I think it is a good idea to distinguish between the spec and the >> actual code. If we all feel comfortable with the spec, I think we could >> finalize it. Being comfortable also means that we know that we have a >> working implementation, but I don't think we have to wrap up all the loose >> ends before voting on the spec. >> >>>> >> >>>> At the PyIceberg side, we're also working to catch up on the V3 >> capabilities. Having a Java release that exposes these capabilities helps, >> so we can do round-trip validation. >> >>>> >> >>>> Kind regards, >> >>>> Fokko >> >>>> >> >>>> >> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >> >>>>> >> >>>>> Hi folks, >> >>>>> >> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >> bounds and geospatial predicate to be merged: >> https://github.com/apache/iceberg/pull/12667 >> >>>>> >> >>>>> Should a release with core updates include this PR? >> >>>>> >> >>>>> Thanks, >> >>>>> Jia >> >>>>> >> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang < >> owenzhang1...@gmail.com> wrote: >> >>>>>> >> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec >> to test implementations, compatibility, etc before finalizing it. >> >>>>>> >> >>>>>> Thanks, >> >>>>>> Manu >> >>>>>> >> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré < >> j...@nanthrax.net> wrote: >> >>>>>>> >> >>>>>>> Hi Ryan >> >>>>>>> >> >>>>>>> It sounds good. >> >>>>>>> >> >>>>>>> About multi-args transforms, with the clarification we did a >> couple of weeks ago, I think we are good. >> >>>>>>> Maybe a release with the core updated before announcing spec v3 >> officially would be a good idea ? >> >>>>>>> >> >>>>>>> Regards >> >>>>>>> JB >> >>>>>>> >> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a >> écrit : >> >>>>>>>> >> >>>>>>>> Hi everyone, >> >>>>>>>> >> >>>>>>>> I think we’ve reached the point where it’s time to finalize and >> adopt the changes for Iceberg v3. We’ve been working toward this for the >> last few months and have now implemented the v3 features in the Java >> library to reduce the risk of needing changes or hitting problems (row >> lineage support in Spark 3.5 just went in!). We’ve also incorporated some >> clarifications and minor changes back into the spec from what we’ve learned. >> >>>>>>>> >> >>>>>>>> At this point, I’m confident that the spec is reasonable and >> correct. Thank you to everyone working on these reference implementations! >> >>>>>>>> >> >>>>>>>> The next step is to discuss any outstanding items or concerns >> about moving forward, and then to have a vote thread to adopt the spec. >> I’ll start off with a couple of items: >> >>>>>>>> >> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t >> yet been finalized by the Parquet community, but we’ve built a full, >> independent implementation in Iceberg to validate the spec. I think the >> Parquet community is primarily waiting on getting the PRs in to have a Java >> reference implementation, so the risk of changes to the Variant spec is >> small. >> >>>>>>>> >> >>>>>>>> There’s also an on-going vote to add encryption keys in support >> of full table encryption that I think we want to get in. >> >>>>>>>> >> >>>>>>>> Any other items we may want to clear up? >> >>>>>>>> >> >>>>>>>> Ryan >> >