Re: [DISCUSS] Finalizing the v3 spec

Manu Zhang Tue, 06 May 2025 09:18:22 -0700

Thanks for clarification Ryan.

I'm aware of the major changes, but I find it hard to go through all the
related descriptions which are scattered all over the place.


Manu

On Tue, May 6, 2025 at 11:24 PM Ryan Blue <rdb...@gmail.com> wrote:

> Manu,
>
> We aren't currently voting. We are discussing any outstanding items to
> address before we close v3 to further changes and adopt the existing v3
> changes. Right now, the open item is to clarify NaN behavior in geometry
> and geography, PR #12956 <https://github.com/apache/iceberg/pull/12956>.
>
> Thanks for noting that the row lineage changes should be added to the
> appendix, I'll open a PR to add it. That appendix is an area to highlight
> things that have changed across versions, but an omission does not alter
> the requirements elsewhere the spec. The changes we are discussing are the
> things that are noted as part of v3 in the spec. The major additions are
> new types, DVs, and row lineage.
>
> Ryan
>
> On Tue, May 6, 2025 at 3:32 AM Manu Zhang <owenzhang1...@gmail.com> wrote:
>
>> I'm wondering what changes we are voting for here. Is it everything
>> related to
>> https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities 
>> from
>> the table spec?
>> How about changes to other specs?
>>
>> Do we summarize all the changes in
>> https://iceberg.apache.org/spec/#appendix-e-format-version-changes? It
>> looks row lineage is missing here.
>>
>> Thanks,
>> Manu
>>
>> On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <aokolnyc...@gmail.com>
>> wrote:
>>
>>> DVs in Spark seem to behave reasonably, serving as a reference
>>> implementation of the V3 spec. There are areas for optimization/refinement
>>> but nothing was observed that requires changing the spec. I would also like
>>> to add the notion of content overhead/metadata (for Puffin/Parquet footers)
>>> to manifests to optimize DVs maintenance. That said, it is optional
>>> information and can be added after finalizing V3.
>>>
>>> - Anton
>>>
>>> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net> пише:
>>>
>>>> Hi Ryan
>>>>
>>>> All good for the spec. The idea for release is just a help to "double
>>>> check" the spec is good (we already saw some slightly changes on the
>>>> spec while working on release). I think we can be "confident" that we
>>>> won't have unexpected change.
>>>>
>>>> Thanks !
>>>> Regards
>>>> JB
>>>>
>>>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>> >
>>>> > Thanks, everyone! Looks like there are a few points to discuss.
>>>> >
>>>> > [JB] Maybe a release with the core updated before announcing spec v3
>>>> officially would be a good idea ?
>>>> > [Manu] Agree with Russell and JB that we make a “RC” release for V3
>>>> spec to test implementations, compatibility, etc before finalizing it.
>>>> >
>>>> > As Fokko noted, we are currently concerned about the spec and not
>>>> implementations. The reason is that implementation work before the spec is
>>>> finalized is to reduce risk and build confidence that the spec is complete
>>>> and correct. Once that’s done, it is important to finalize the changes. If
>>>> we don’t finalize the changes, then implementations don’t know how/what
>>>> build and cannot plan when they will fully support v3 — because it could
>>>> change. Most of the work in other implementations will take place after the
>>>> spec is adopted.
>>>> >
>>>> > Our process for building confidence in new spec versions is to update
>>>> the spec with pending changes, implement them to validate (and clarify or
>>>> adjust as needed), and vote to adopt the new version as a confirmation that
>>>> we agree that the spec changes are reasonable and correct.
>>>> >
>>>> > We’ve already voted to accept the pending v3 changes into the spec,
>>>> so the changes have already been in a candidate state for quite some time
>>>> to work on implementations. Now we’re at the point where we’ve implemented
>>>> the features and, in my opinion, have demonstrated the spec changes are
>>>> correct and complete.
>>>> >
>>>> > To that end, the question I’m raising in this thread is “what areas
>>>> and features need further validation?”
>>>> >
>>>> > I appreciate the ideas here — releasing will assist other
>>>> implementations — but I don’t think that changes the question for this
>>>> thread. The aim is to identify specific risks and blockers that we need to
>>>> tackle before adopting the changes.
>>>> >
>>>> > [Russell] We should probably come to a resolution on the compressed
>>>> metadata.json name as well, although that’s mostly retroactive. V3 would be
>>>> the place where we could officially change the naming convention.
>>>> >
>>>> > I don’t think that this affects v3, but we should agree before moving
>>>> on. The only part of the spec that would depend on this is the paths used
>>>> by file system tables and that strategy is deprecated. We should only
>>>> document for clarify (we can’t change it) and I think we can do that any
>>>> time.
>>>> >
>>>> > For the conventions used in catalog tables, I don’t think that we
>>>> want to have requirements in the spec for file naming. We’ve avoided that
>>>> in the past and it isn’t needed. It’s nice to have a convention in
>>>> implementation notes, but there are other ways to handle this like magic
>>>> bytes and catalog tracking.
>>>> >
>>>> > [Gang] it is implicit and obvious that only bucket transform can
>>>> apply to multi-arg transform, it is still unclear the order of source
>>>> columns and algorithm to use to calculate the bucket value
>>>> >
>>>> > I think there is some confusion here, but Fokko may have already
>>>> cleared it up.
>>>> >
>>>> > Right now, there are no multi-argument transforms in the spec. We
>>>> have discussed adding a multi-argument bucket function, but there is not
>>>> currently one in the spec. In order to minimize changes required for v3, we
>>>> opted to update the spec to allow adding new transforms in a
>>>> forward-compatible way between major spec versions (implementations must
>>>> ignore unknown transforms).
>>>> >
>>>> > [Jia] We’re currently addressing the handling of null/NaN values for
>>>> X, Y, Z, and M coordinates in the Parquet format repository
>>>> >
>>>> > I agree that this is a good thing to clarify. We currently state that
>>>> the ranges are [-180, 180] and [-90, 90] for geography, but we should state
>>>> how points with NaN values are handled.
>>>> >
>>>> >
>>>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hi Jia
>>>> >>
>>>> >> I feel it would be nice to get that Parquet spec clarificiation
>>>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec
>>>> as well, once we finalize that.
>>>> >>
>>>> >> Thanks
>>>> >> Szehon
>>>> >>
>>>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote:
>>>> >>>
>>>> >>> Hi Szehon,
>>>> >>>
>>>> >>> Thanks for clarifying it.
>>>> >>>
>>>> >>> We’re currently addressing the handling of null/NaN values for X,
>>>> Y, Z, and M coordinates in the Parquet format repository. We’ve already
>>>> concluded that the spec of Parquet (same on the Iceberg side I believe)
>>>> only needs additional clarification to guide expected behavior:
>>>> https://github.com/apache/parquet-format/pull/494
>>>> >>>
>>>> >>> BTW the Parquet Geo C++ PR has been merged today:
>>>> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo
>>>> Java PR is also very close.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Jia
>>>> >>>
>>>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>> >>>>
>>>> >>>> Hey Ryan,
>>>> >>>>
>>>> >>>> Thanks for raising this, and I'm very excited to see V3 being
>>>> finalized!
>>>> >>>>
>>>> >>>>> The v3 spec for multi-arg transform only advises to use
>>>> `source-ids` instead of `source-id`. Although it is implicit and obvious
>>>> that only bucket transform can apply to multi-arg transform, it is still
>>>> unclear the order of source columns and algorithm to use to calculate the
>>>> bucket value.
>>>> >>>>
>>>> >>>>
>>>> >>>> V3 now uses source IDs when there are multiple arguments and
>>>> source IDs when there is just one. PR can be found here. This makes the
>>>> serialization deterministic without knowing the format-version, simplifying
>>>> the readers/writers. After some discussion on the PR, we've decided to
>>>> leave out the multi-arg bucket transform so the V3 spec can be finalized.
>>>> So V3 only contains the scaffolding for multi-arg transforms.
>>>> >>>>
>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>>> bounds and geospatial predicate to be merged:
>>>> https://github.com/apache/iceberg/pull/12667
>>>> >>>>
>>>> >>>>
>>>> >>>> I think it is a good idea to distinguish between the spec and the
>>>> actual code. If we all feel comfortable with the spec, I think we could
>>>> finalize it. Being comfortable also means that we know that we have a
>>>> working implementation, but I don't think we have to wrap up all the loose
>>>> ends before voting on the spec.
>>>> >>>>
>>>> >>>> At the PyIceberg side, we're also working to catch up on the V3
>>>> capabilities. Having a Java release that exposes these capabilities helps,
>>>> so we can do round-trip validation.
>>>> >>>>
>>>> >>>> Kind regards,
>>>> >>>> Fokko
>>>> >>>>
>>>> >>>>
>>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>:
>>>> >>>>>
>>>> >>>>> Hi folks,
>>>> >>>>>
>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>>> bounds and geospatial predicate to be merged:
>>>> https://github.com/apache/iceberg/pull/12667
>>>> >>>>>
>>>> >>>>> Should a release with core updates include this PR?
>>>> >>>>>
>>>> >>>>> Thanks,
>>>> >>>>> Jia
>>>> >>>>>
>>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <
>>>> owenzhang1...@gmail.com> wrote:
>>>> >>>>>>
>>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3
>>>> spec to test implementations, compatibility, etc before finalizing it.
>>>> >>>>>>
>>>> >>>>>> Thanks,
>>>> >>>>>> Manu
>>>> >>>>>>
>>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <
>>>> j...@nanthrax.net> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Hi Ryan
>>>> >>>>>>>
>>>> >>>>>>> It sounds good.
>>>> >>>>>>>
>>>> >>>>>>> About multi-args transforms, with the clarification we did a
>>>> couple of weeks ago, I think we are good.
>>>> >>>>>>> Maybe a release with the core updated before announcing spec v3
>>>> officially would be a good idea ?
>>>> >>>>>>>
>>>> >>>>>>> Regards
>>>> >>>>>>> JB
>>>> >>>>>>>
>>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a
>>>> écrit :
>>>> >>>>>>>>
>>>> >>>>>>>> Hi everyone,
>>>> >>>>>>>>
>>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize
>>>> and adopt the changes for Iceberg v3. We’ve been working toward this for
>>>> the last few months and have now implemented the v3 features in the Java
>>>> library to reduce the risk of needing changes or hitting problems (row
>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated some
>>>> clarifications and minor changes back into the spec from what we’ve 
>>>> learned.
>>>> >>>>>>>>
>>>> >>>>>>>> At this point, I’m confident that the spec is reasonable and
>>>> correct. Thank you to everyone working on these reference implementations!
>>>> >>>>>>>>
>>>> >>>>>>>> The next step is to discuss any outstanding items or concerns
>>>> about moving forward, and then to have a vote thread to adopt the spec.
>>>> I’ll start off with a couple of items:
>>>> >>>>>>>>
>>>> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t
>>>> yet been finalized by the Parquet community, but we’ve built a full,
>>>> independent implementation in Iceberg to validate the spec. I think the
>>>> Parquet community is primarily waiting on getting the PRs in to have a Java
>>>> reference implementation, so the risk of changes to the Variant spec is
>>>> small.
>>>> >>>>>>>>
>>>> >>>>>>>> There’s also an on-going vote to add encryption keys in
>>>> support of full table encryption that I think we want to get in.
>>>> >>>>>>>>
>>>> >>>>>>>> Any other items we may want to clear up?
>>>> >>>>>>>>
>>>> >>>>>>>> Ryan
>>>>
>>>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to