Re: [DISCUSS] Finalizing the v3 spec

Ryan Blue Tue, 06 May 2025 08:24:56 -0700

Manu,

We aren't currently voting. We are discussing any outstanding items to
address before we close v3 to further changes and adopt the existing v3
changes. Right now, the open item is to clarify NaN behavior in geometry
and geography, PR #12956 <https://github.com/apache/iceberg/pull/12956>.


Thanks for noting that the row lineage changes should be added to the
appendix, I'll open a PR to add it. That appendix is an area to highlight
things that have changed across versions, but an omission does not alter
the requirements elsewhere the spec. The changes we are discussing are the
things that are noted as part of v3 in the spec. The major additions are
new types, DVs, and row lineage.

Ryan

On Tue, May 6, 2025 at 3:32 AM Manu Zhang <[email protected]> wrote:

> I'm wondering what changes we are voting for here. Is it everything
> related to
> https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities 
> from
> the table spec?
> How about changes to other specs?
>
> Do we summarize all the changes in
> https://iceberg.apache.org/spec/#appendix-e-format-version-changes? It
> looks row lineage is missing here.
>
> Thanks,
> Manu
>
> On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> DVs in Spark seem to behave reasonably, serving as a reference
>> implementation of the V3 spec. There are areas for optimization/refinement
>> but nothing was observed that requires changing the spec. I would also like
>> to add the notion of content overhead/metadata (for Puffin/Parquet footers)
>> to manifests to optimize DVs maintenance. That said, it is optional
>> information and can be added after finalizing V3.
>>
>> - Anton
>>
>> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <[email protected]> пише:
>>
>>> Hi Ryan
>>>
>>> All good for the spec. The idea for release is just a help to "double
>>> check" the spec is good (we already saw some slightly changes on the
>>> spec while working on release). I think we can be "confident" that we
>>> won't have unexpected change.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <[email protected]> wrote:
>>> >
>>> > Thanks, everyone! Looks like there are a few points to discuss.
>>> >
>>> > [JB] Maybe a release with the core updated before announcing spec v3
>>> officially would be a good idea ?
>>> > [Manu] Agree with Russell and JB that we make a “RC” release for V3
>>> spec to test implementations, compatibility, etc before finalizing it.
>>> >
>>> > As Fokko noted, we are currently concerned about the spec and not
>>> implementations. The reason is that implementation work before the spec is
>>> finalized is to reduce risk and build confidence that the spec is complete
>>> and correct. Once that’s done, it is important to finalize the changes. If
>>> we don’t finalize the changes, then implementations don’t know how/what
>>> build and cannot plan when they will fully support v3 — because it could
>>> change. Most of the work in other implementations will take place after the
>>> spec is adopted.
>>> >
>>> > Our process for building confidence in new spec versions is to update
>>> the spec with pending changes, implement them to validate (and clarify or
>>> adjust as needed), and vote to adopt the new version as a confirmation that
>>> we agree that the spec changes are reasonable and correct.
>>> >
>>> > We’ve already voted to accept the pending v3 changes into the spec, so
>>> the changes have already been in a candidate state for quite some time to
>>> work on implementations. Now we’re at the point where we’ve implemented the
>>> features and, in my opinion, have demonstrated the spec changes are correct
>>> and complete.
>>> >
>>> > To that end, the question I’m raising in this thread is “what areas
>>> and features need further validation?”
>>> >
>>> > I appreciate the ideas here — releasing will assist other
>>> implementations — but I don’t think that changes the question for this
>>> thread. The aim is to identify specific risks and blockers that we need to
>>> tackle before adopting the changes.
>>> >
>>> > [Russell] We should probably come to a resolution on the compressed
>>> metadata.json name as well, although that’s mostly retroactive. V3 would be
>>> the place where we could officially change the naming convention.
>>> >
>>> > I don’t think that this affects v3, but we should agree before moving
>>> on. The only part of the spec that would depend on this is the paths used
>>> by file system tables and that strategy is deprecated. We should only
>>> document for clarify (we can’t change it) and I think we can do that any
>>> time.
>>> >
>>> > For the conventions used in catalog tables, I don’t think that we want
>>> to have requirements in the spec for file naming. We’ve avoided that in the
>>> past and it isn’t needed. It’s nice to have a convention in implementation
>>> notes, but there are other ways to handle this like magic bytes and catalog
>>> tracking.
>>> >
>>> > [Gang] it is implicit and obvious that only bucket transform can apply
>>> to multi-arg transform, it is still unclear the order of source columns and
>>> algorithm to use to calculate the bucket value
>>> >
>>> > I think there is some confusion here, but Fokko may have already
>>> cleared it up.
>>> >
>>> > Right now, there are no multi-argument transforms in the spec. We have
>>> discussed adding a multi-argument bucket function, but there is not
>>> currently one in the spec. In order to minimize changes required for v3, we
>>> opted to update the spec to allow adding new transforms in a
>>> forward-compatible way between major spec versions (implementations must
>>> ignore unknown transforms).
>>> >
>>> > [Jia] We’re currently addressing the handling of null/NaN values for
>>> X, Y, Z, and M coordinates in the Parquet format repository
>>> >
>>> > I agree that this is a good thing to clarify. We currently state that
>>> the ranges are [-180, 180] and [-90, 90] for geography, but we should state
>>> how points with NaN values are handled.
>>> >
>>> >
>>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi Jia
>>> >>
>>> >> I feel it would be nice to get that Parquet spec clarificiation
>>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec
>>> as well, once we finalize that.
>>> >>
>>> >> Thanks
>>> >> Szehon
>>> >>
>>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <[email protected]> wrote:
>>> >>>
>>> >>> Hi Szehon,
>>> >>>
>>> >>> Thanks for clarifying it.
>>> >>>
>>> >>> We’re currently addressing the handling of null/NaN values for X, Y,
>>> Z, and M coordinates in the Parquet format repository. We’ve already
>>> concluded that the spec of Parquet (same on the Iceberg side I believe)
>>> only needs additional clarification to guide expected behavior:
>>> https://github.com/apache/parquet-format/pull/494
>>> >>>
>>> >>> BTW the Parquet Geo C++ PR has been merged today:
>>> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo
>>> Java PR is also very close.
>>> >>>
>>> >>> Thanks,
>>> >>> Jia
>>> >>>
>>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>> Hey Ryan,
>>> >>>>
>>> >>>> Thanks for raising this, and I'm very excited to see V3 being
>>> finalized!
>>> >>>>
>>> >>>>> The v3 spec for multi-arg transform only advises to use
>>> `source-ids` instead of `source-id`. Although it is implicit and obvious
>>> that only bucket transform can apply to multi-arg transform, it is still
>>> unclear the order of source columns and algorithm to use to calculate the
>>> bucket value.
>>> >>>>
>>> >>>>
>>> >>>> V3 now uses source IDs when there are multiple arguments and source
>>> IDs when there is just one. PR can be found here. This makes the
>>> serialization deterministic without knowing the format-version, simplifying
>>> the readers/writers. After some discussion on the PR, we've decided to
>>> leave out the multi-arg bucket transform so the V3 spec can be finalized.
>>> So V3 only contains the scaffolding for multi-arg transforms.
>>> >>>>
>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>> bounds and geospatial predicate to be merged:
>>> https://github.com/apache/iceberg/pull/12667
>>> >>>>
>>> >>>>
>>> >>>> I think it is a good idea to distinguish between the spec and the
>>> actual code. If we all feel comfortable with the spec, I think we could
>>> finalize it. Being comfortable also means that we know that we have a
>>> working implementation, but I don't think we have to wrap up all the loose
>>> ends before voting on the spec.
>>> >>>>
>>> >>>> At the PyIceberg side, we're also working to catch up on the V3
>>> capabilities. Having a Java release that exposes these capabilities helps,
>>> so we can do round-trip validation.
>>> >>>>
>>> >>>> Kind regards,
>>> >>>> Fokko
>>> >>>>
>>> >>>>
>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <[email protected]>:
>>> >>>>>
>>> >>>>> Hi folks,
>>> >>>>>
>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>> bounds and geospatial predicate to be merged:
>>> https://github.com/apache/iceberg/pull/12667
>>> >>>>>
>>> >>>>> Should a release with core updates include this PR?
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Jia
>>> >>>>>
>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <
>>> [email protected]> wrote:
>>> >>>>>>
>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec
>>> to test implementations, compatibility, etc before finalizing it.
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Manu
>>> >>>>>>
>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <
>>> [email protected]> wrote:
>>> >>>>>>>
>>> >>>>>>> Hi Ryan
>>> >>>>>>>
>>> >>>>>>> It sounds good.
>>> >>>>>>>
>>> >>>>>>> About multi-args transforms, with the clarification we did a
>>> couple of weeks ago, I think we are good.
>>> >>>>>>> Maybe a release with the core updated before announcing spec v3
>>> officially would be a good idea ?
>>> >>>>>>>
>>> >>>>>>> Regards
>>> >>>>>>> JB
>>> >>>>>>>
>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <[email protected]> a
>>> écrit :
>>> >>>>>>>>
>>> >>>>>>>> Hi everyone,
>>> >>>>>>>>
>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize and
>>> adopt the changes for Iceberg v3. We’ve been working toward this for the
>>> last few months and have now implemented the v3 features in the Java
>>> library to reduce the risk of needing changes or hitting problems (row
>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated some
>>> clarifications and minor changes back into the spec from what we’ve learned.
>>> >>>>>>>>
>>> >>>>>>>> At this point, I’m confident that the spec is reasonable and
>>> correct. Thank you to everyone working on these reference implementations!
>>> >>>>>>>>
>>> >>>>>>>> The next step is to discuss any outstanding items or concerns
>>> about moving forward, and then to have a vote thread to adopt the spec.
>>> I’ll start off with a couple of items:
>>> >>>>>>>>
>>> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t
>>> yet been finalized by the Parquet community, but we’ve built a full,
>>> independent implementation in Iceberg to validate the spec. I think the
>>> Parquet community is primarily waiting on getting the PRs in to have a Java
>>> reference implementation, so the risk of changes to the Variant spec is
>>> small.
>>> >>>>>>>>
>>> >>>>>>>> There’s also an on-going vote to add encryption keys in support
>>> of full table encryption that I think we want to get in.
>>> >>>>>>>>
>>> >>>>>>>> Any other items we may want to clear up?
>>> >>>>>>>>
>>> >>>>>>>> Ryan
>>>
>>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to