Hi Jia I feel it would be nice to get that Parquet spec clarificiation https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as well, once we finalize that.
Thanks Szehon On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: > Hi Szehon, > > Thanks for clarifying it. > > We’re currently addressing the handling of null/NaN values for X, Y, Z, > and M coordinates in the Parquet format repository. We’ve already concluded > that the spec of Parquet (same on the Iceberg side I believe) only needs > additional clarification to guide expected behavior: > https://github.com/apache/parquet-format/pull/494 > > BTW the Parquet Geo C++ PR has been merged today: > https://github.com/apache/arrow/pull/45459 I believe the Parquet Geo > Java PR is also very close. > > Thanks, > Jia > > On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> > wrote: > >> Hey Ryan, >> >> Thanks for raising this, and I'm very excited to see V3 being finalized! >> >> The v3 spec for multi-arg transform only advises to use `source-ids` >>> instead of `source-id`. Although it is implicit and obvious that only >>> bucket transform can apply to multi-arg transform, it is still unclear the >>> order of source columns and algorithm to use to calculate the bucket value. >>> >> >> V3 now uses source IDs when there are multiple arguments and source IDs >> when there is just one. PR can be found here >> <https://github.com/apache/iceberg/pull/12644>. This makes the >> serialization deterministic without knowing the format-version, simplifying >> the readers/writers. After some discussion on the PR, we've decided to >> leave out the multi-arg bucket transform so the V3 spec can be finalized. >> So V3 only contains the scaffolding for multi-arg transforms. >> >> For Iceberg Geo, we are still waiting for the PR of geospatial bounds and >>> geospatial predicate to be merged: >>> https://github.com/apache/iceberg/pull/12667 >> >> >> I think it is a good idea to distinguish between the spec and the actual >> code. If we all feel comfortable with the spec, I think we could finalize >> it. Being comfortable also means that we know that we have a working >> implementation, but I don't think we have to wrap up all the loose ends >> before voting on the spec. >> >> At the PyIceberg side, we're also working to catch up on the V3 >> capabilities <https://github.com/apache/iceberg-python/issues/1818>. >> Having a Java release that exposes these capabilities helps, so we can do >> round-trip validation. >> >> Kind regards, >> Fokko >> >> >> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >> >>> Hi folks, >>> >>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds >>> and geospatial predicate to be merged: >>> https://github.com/apache/iceberg/pull/12667 >>> >>> Should a release with core updates include this PR? >>> >>> Thanks, >>> Jia >>> >>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com> >>> wrote: >>> >>>> Agree with Russell and JB that we make a "RC" release for V3 spec to >>>> test implementations, compatibility, etc before finalizing it. >>>> >>>> Thanks, >>>> Manu >>>> >>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>> >>>>> Hi Ryan >>>>> >>>>> It sounds good. >>>>> >>>>> About multi-args transforms, with the clarification we did a couple of >>>>> weeks ago, I think we are good. >>>>> Maybe a release with the core updated before announcing spec v3 >>>>> officially would be a good idea ? >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit : >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I think we’ve reached the point where it’s time to finalize and adopt >>>>>> the changes for Iceberg v3. We’ve been working toward this for the last >>>>>> few >>>>>> months and have now implemented the v3 features in the Java library to >>>>>> reduce the risk of needing changes or hitting problems (row lineage >>>>>> support >>>>>> in Spark 3.5 just went in!). We’ve also incorporated some clarifications >>>>>> and minor changes back into the spec from what we’ve learned. >>>>>> >>>>>> At this point, I’m confident that the spec is reasonable and correct. >>>>>> Thank you to everyone working on these reference implementations! >>>>>> >>>>>> The next step is to discuss any outstanding items or concerns about >>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll >>>>>> start off with a couple of items: >>>>>> >>>>>> One potential concern is that the upstream Variant spec hasn’t yet >>>>>> been finalized by the Parquet community, but we’ve built a full, >>>>>> independent implementation in Iceberg to validate the spec. I think the >>>>>> Parquet community is primarily waiting on getting the PRs in to have a >>>>>> Java >>>>>> reference implementation, so the risk of changes to the Variant spec is >>>>>> small. >>>>>> >>>>>> There’s also an on-going vote to add encryption keys in support of >>>>>> full table encryption that I think we want to get in. >>>>>> >>>>>> Any other items we may want to clear up? >>>>>> >>>>>> Ryan >>>>>> >>>>>