Re: [DISCUSS] Finalizing the v3 spec

Manu Zhang Tue, 06 May 2025 03:32:35 -0700

I'm wondering what changes we are voting for here. Is it everything related
to
https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities from
the table spec?
How about changes to other specs?


Do we summarize all the changes in
https://iceberg.apache.org/spec/#appendix-e-format-version-changes? It
looks row lineage is missing here.

Thanks,
Manu

On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <[email protected]>
wrote:

> DVs in Spark seem to behave reasonably, serving as a reference
> implementation of the V3 spec. There are areas for optimization/refinement
> but nothing was observed that requires changing the spec. I would also like
> to add the notion of content overhead/metadata (for Puffin/Parquet footers)
> to manifests to optimize DVs maintenance. That said, it is optional
> information and can be added after finalizing V3.
>
> - Anton
>
> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <[email protected]> пише:
>
>> Hi Ryan
>>
>> All good for the spec. The idea for release is just a help to "double
>> check" the spec is good (we already saw some slightly changes on the
>> spec while working on release). I think we can be "confident" that we
>> won't have unexpected change.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <[email protected]> wrote:
>> >
>> > Thanks, everyone! Looks like there are a few points to discuss.
>> >
>> > [JB] Maybe a release with the core updated before announcing spec v3
>> officially would be a good idea ?
>> > [Manu] Agree with Russell and JB that we make a “RC” release for V3
>> spec to test implementations, compatibility, etc before finalizing it.
>> >
>> > As Fokko noted, we are currently concerned about the spec and not
>> implementations. The reason is that implementation work before the spec is
>> finalized is to reduce risk and build confidence that the spec is complete
>> and correct. Once that’s done, it is important to finalize the changes. If
>> we don’t finalize the changes, then implementations don’t know how/what
>> build and cannot plan when they will fully support v3 — because it could
>> change. Most of the work in other implementations will take place after the
>> spec is adopted.
>> >
>> > Our process for building confidence in new spec versions is to update
>> the spec with pending changes, implement them to validate (and clarify or
>> adjust as needed), and vote to adopt the new version as a confirmation that
>> we agree that the spec changes are reasonable and correct.
>> >
>> > We’ve already voted to accept the pending v3 changes into the spec, so
>> the changes have already been in a candidate state for quite some time to
>> work on implementations. Now we’re at the point where we’ve implemented the
>> features and, in my opinion, have demonstrated the spec changes are correct
>> and complete.
>> >
>> > To that end, the question I’m raising in this thread is “what areas and
>> features need further validation?”
>> >
>> > I appreciate the ideas here — releasing will assist other
>> implementations — but I don’t think that changes the question for this
>> thread. The aim is to identify specific risks and blockers that we need to
>> tackle before adopting the changes.
>> >
>> > [Russell] We should probably come to a resolution on the compressed
>> metadata.json name as well, although that’s mostly retroactive. V3 would be
>> the place where we could officially change the naming convention.
>> >
>> > I don’t think that this affects v3, but we should agree before moving
>> on. The only part of the spec that would depend on this is the paths used
>> by file system tables and that strategy is deprecated. We should only
>> document for clarify (we can’t change it) and I think we can do that any
>> time.
>> >
>> > For the conventions used in catalog tables, I don’t think that we want
>> to have requirements in the spec for file naming. We’ve avoided that in the
>> past and it isn’t needed. It’s nice to have a convention in implementation
>> notes, but there are other ways to handle this like magic bytes and catalog
>> tracking.
>> >
>> > [Gang] it is implicit and obvious that only bucket transform can apply
>> to multi-arg transform, it is still unclear the order of source columns and
>> algorithm to use to calculate the bucket value
>> >
>> > I think there is some confusion here, but Fokko may have already
>> cleared it up.
>> >
>> > Right now, there are no multi-argument transforms in the spec. We have
>> discussed adding a multi-argument bucket function, but there is not
>> currently one in the spec. In order to minimize changes required for v3, we
>> opted to update the spec to allow adding new transforms in a
>> forward-compatible way between major spec versions (implementations must
>> ignore unknown transforms).
>> >
>> > [Jia] We’re currently addressing the handling of null/NaN values for X,
>> Y, Z, and M coordinates in the Parquet format repository
>> >
>> > I agree that this is a good thing to clarify. We currently state that
>> the ranges are [-180, 180] and [-90, 90] for geography, but we should state
>> how points with NaN values are handled.
>> >
>> >
>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <[email protected]>
>> wrote:
>> >>
>> >> Hi Jia
>> >>
>> >> I feel it would be nice to get that Parquet spec clarificiation
>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec
>> as well, once we finalize that.
>> >>
>> >> Thanks
>> >> Szehon
>> >>
>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <[email protected]> wrote:
>> >>>
>> >>> Hi Szehon,
>> >>>
>> >>> Thanks for clarifying it.
>> >>>
>> >>> We’re currently addressing the handling of null/NaN values for X, Y,
>> Z, and M coordinates in the Parquet format repository. We’ve already
>> concluded that the spec of Parquet (same on the Iceberg side I believe)
>> only needs additional clarification to guide expected behavior:
>> https://github.com/apache/parquet-format/pull/494
>> >>>
>> >>> BTW the Parquet Geo C++ PR has been merged today:
>> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo
>> Java PR is also very close.
>> >>>
>> >>> Thanks,
>> >>> Jia
>> >>>
>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <[email protected]>
>> wrote:
>> >>>>
>> >>>> Hey Ryan,
>> >>>>
>> >>>> Thanks for raising this, and I'm very excited to see V3 being
>> finalized!
>> >>>>
>> >>>>> The v3 spec for multi-arg transform only advises to use
>> `source-ids` instead of `source-id`. Although it is implicit and obvious
>> that only bucket transform can apply to multi-arg transform, it is still
>> unclear the order of source columns and algorithm to use to calculate the
>> bucket value.
>> >>>>
>> >>>>
>> >>>> V3 now uses source IDs when there are multiple arguments and source
>> IDs when there is just one. PR can be found here. This makes the
>> serialization deterministic without knowing the format-version, simplifying
>> the readers/writers. After some discussion on the PR, we've decided to
>> leave out the multi-arg bucket transform so the V3 spec can be finalized.
>> So V3 only contains the scaffolding for multi-arg transforms.
>> >>>>
>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>> bounds and geospatial predicate to be merged:
>> https://github.com/apache/iceberg/pull/12667
>> >>>>
>> >>>>
>> >>>> I think it is a good idea to distinguish between the spec and the
>> actual code. If we all feel comfortable with the spec, I think we could
>> finalize it. Being comfortable also means that we know that we have a
>> working implementation, but I don't think we have to wrap up all the loose
>> ends before voting on the spec.
>> >>>>
>> >>>> At the PyIceberg side, we're also working to catch up on the V3
>> capabilities. Having a Java release that exposes these capabilities helps,
>> so we can do round-trip validation.
>> >>>>
>> >>>> Kind regards,
>> >>>> Fokko
>> >>>>
>> >>>>
>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <[email protected]>:
>> >>>>>
>> >>>>> Hi folks,
>> >>>>>
>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>> bounds and geospatial predicate to be merged:
>> https://github.com/apache/iceberg/pull/12667
>> >>>>>
>> >>>>> Should a release with core updates include this PR?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Jia
>> >>>>>
>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <
>> [email protected]> wrote:
>> >>>>>>
>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec
>> to test implementations, compatibility, etc before finalizing it.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Manu
>> >>>>>>
>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <
>> [email protected]> wrote:
>> >>>>>>>
>> >>>>>>> Hi Ryan
>> >>>>>>>
>> >>>>>>> It sounds good.
>> >>>>>>>
>> >>>>>>> About multi-args transforms, with the clarification we did a
>> couple of weeks ago, I think we are good.
>> >>>>>>> Maybe a release with the core updated before announcing spec v3
>> officially would be a good idea ?
>> >>>>>>>
>> >>>>>>> Regards
>> >>>>>>> JB
>> >>>>>>>
>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <[email protected]> a
>> écrit :
>> >>>>>>>>
>> >>>>>>>> Hi everyone,
>> >>>>>>>>
>> >>>>>>>> I think we’ve reached the point where it’s time to finalize and
>> adopt the changes for Iceberg v3. We’ve been working toward this for the
>> last few months and have now implemented the v3 features in the Java
>> library to reduce the risk of needing changes or hitting problems (row
>> lineage support in Spark 3.5 just went in!). We’ve also incorporated some
>> clarifications and minor changes back into the spec from what we’ve learned.
>> >>>>>>>>
>> >>>>>>>> At this point, I’m confident that the spec is reasonable and
>> correct. Thank you to everyone working on these reference implementations!
>> >>>>>>>>
>> >>>>>>>> The next step is to discuss any outstanding items or concerns
>> about moving forward, and then to have a vote thread to adopt the spec.
>> I’ll start off with a couple of items:
>> >>>>>>>>
>> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t
>> yet been finalized by the Parquet community, but we’ve built a full,
>> independent implementation in Iceberg to validate the spec. I think the
>> Parquet community is primarily waiting on getting the PRs in to have a Java
>> reference implementation, so the risk of changes to the Variant spec is
>> small.
>> >>>>>>>>
>> >>>>>>>> There’s also an on-going vote to add encryption keys in support
>> of full table encryption that I think we want to get in.
>> >>>>>>>>
>> >>>>>>>> Any other items we may want to clear up?
>> >>>>>>>>
>> >>>>>>>> Ryan
>>
>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to