Re: [DISCUSS] Finalizing the v3 spec

Russell Spitzer Fri, 02 May 2025 09:37:38 -0700

Sounds good to me, I think we can move ahead with this, for all intents and
purposes I think we are past any breaking changes for Spec V3 and should
consider it "stable" for implementation purposes. I want to work on some
official descriptions of our spec versioning / library process to better
explain this to outside users but that can happen orthogonally.


On Thu, May 1, 2025 at 12:05 PM Ryan Blue <rdb...@gmail.com> wrote:

> Thanks, everyone! Looks like there are a few points to discuss.
>
> [JB] Maybe a release with the core updated before announcing spec v3
> officially would be a good idea ?
> [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec
> to test implementations, compatibility, etc before finalizing it.
>
> As Fokko noted, we are currently concerned about the spec and not
> implementations. The reason is that implementation work before the spec is
> finalized is to reduce risk and build confidence that the spec is complete
> and correct. Once that’s done, it is important to finalize the changes. If
> we don’t finalize the changes, then implementations don’t know how/what
> build and cannot plan when they will fully support v3 — because it could
> change. Most of the work in other implementations will take place after the
> spec is adopted.
>
> Our process for building confidence in new spec versions is to update the
> spec with pending changes, implement them to validate (and clarify or
> adjust as needed), and vote to adopt the new version as a confirmation that
> we agree that the spec changes are reasonable and correct.
>
> We’ve already voted to accept the pending v3 changes into the spec, so the
> changes have already been in a candidate state for quite some time to work
> on implementations. Now we’re at the point where we’ve implemented the
> features and, in my opinion, have demonstrated the spec changes are correct
> and complete.
>
> To that end, the question I’m raising in this thread is *“what areas and
> features need further validation?”*
>
> I appreciate the ideas here — releasing will assist other implementations
> — but I don’t think that changes the question for this thread. The aim is
> to identify specific risks and blockers that we need to tackle before
> adopting the changes.
>
> [Russell] We should probably come to a resolution on the compressed
> metadata.json name as well, although that’s mostly retroactive. V3 would be
> the place where we could officially change the naming convention.
>
> I don’t think that this affects v3, but we should agree before moving on.
> The only part of the spec that would depend on this is the paths used by
> file system tables and that strategy is deprecated. We should only document
> for clarify (we can’t change it) and I think we can do that any time.
>
> For the conventions used in catalog tables, I don’t think that we want to
> have requirements in the spec for file naming. We’ve avoided that in the
> past and it isn’t needed. It’s nice to have a convention in implementation
> notes, but there are other ways to handle this like magic bytes and catalog
> tracking.
>
> [Gang] it is implicit and obvious that only bucket transform can apply to
> multi-arg transform, it is still unclear the order of source columns and
> algorithm to use to calculate the bucket value
>
> I think there is some confusion here, but Fokko may have already cleared
> it up.
>
> Right now, there are no multi-argument transforms in the spec. We have
> discussed adding a multi-argument bucket function, but there is not
> currently one in the spec. In order to minimize changes required for v3, we
> opted to update the spec to allow adding new transforms in a
> forward-compatible way between major spec versions (implementations must
> ignore unknown transforms).
>
> [Jia] We’re currently addressing the handling of null/NaN values for X, Y,
> Z, and M coordinates in the Parquet format repository
>
> I agree that this is a good thing to clarify. We currently state that the
> ranges are [-180, 180] and [-90, 90] for geography, but we should state how
> points with NaN values are handled.
>
> On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com>
> wrote:
>
>> Hi Jia
>>
>> I feel it would be nice to get that Parquet spec clarificiation
>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec
>> as well, once we finalize that.
>>
>> Thanks
>> Szehon
>>
>> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote:
>>
>>> Hi Szehon,
>>>
>>> Thanks for clarifying it.
>>>
>>> We’re currently addressing the handling of null/NaN values for X, Y, Z,
>>> and M coordinates in the Parquet format repository. We’ve already concluded
>>> that the spec of Parquet (same on the Iceberg side I believe) only needs
>>> additional clarification to guide expected behavior:
>>> https://github.com/apache/parquet-format/pull/494
>>>
>>> BTW the Parquet Geo C++ PR has been merged today:
>>> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo
>>> Java PR is also very close.
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org>
>>> wrote:
>>>
>>>> Hey Ryan,
>>>>
>>>> Thanks for raising this, and I'm very excited to see V3 being finalized!
>>>>
>>>> The v3 spec for multi-arg transform only advises to use `source-ids`
>>>>> instead of `source-id`. Although it is implicit and obvious that only
>>>>> bucket transform can apply to multi-arg transform, it is still unclear the
>>>>> order of source columns and algorithm to use to calculate the bucket 
>>>>> value.
>>>>>
>>>>
>>>> V3 now uses source IDs when there are multiple arguments and source IDs
>>>> when there is just one. PR can be found here
>>>> <https://github.com/apache/iceberg/pull/12644>. This makes the
>>>> serialization deterministic without knowing the format-version, simplifying
>>>> the readers/writers. After some discussion on the PR, we've decided to
>>>> leave out the multi-arg bucket transform so the V3 spec can be finalized.
>>>> So V3 only contains the scaffolding for multi-arg transforms.
>>>>
>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds
>>>>> and geospatial predicate to be merged:
>>>>> https://github.com/apache/iceberg/pull/12667
>>>>
>>>>
>>>> I think it is a good idea to distinguish between the spec and the
>>>> actual code. If we all feel comfortable with the spec, I think we could
>>>> finalize it. Being comfortable also means that we know that we have a
>>>> working implementation, but I don't think we have to wrap up all the loose
>>>> ends before voting on the spec.
>>>>
>>>> At the PyIceberg side, we're also working to catch up on the V3
>>>> capabilities <https://github.com/apache/iceberg-python/issues/1818>.
>>>> Having a Java release that exposes these capabilities helps, so we can do
>>>> round-trip validation.
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>>
>>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds
>>>>> and geospatial predicate to be merged:
>>>>> https://github.com/apache/iceberg/pull/12667
>>>>>
>>>>> Should a release with core updates include this PR?
>>>>>
>>>>> Thanks,
>>>>> Jia
>>>>>
>>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to
>>>>>> test implementations, compatibility, etc before finalizing it.
>>>>>>
>>>>>> Thanks,
>>>>>> Manu
>>>>>>
>>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net> wrote:
>>>>>>
>>>>>>> Hi Ryan
>>>>>>>
>>>>>>> It sounds good.
>>>>>>>
>>>>>>> About multi-args transforms, with the clarification we did a couple
>>>>>>> of weeks ago, I think we are good.
>>>>>>> Maybe a release with the core updated before announcing spec v3
>>>>>>> officially would be a good idea ?
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit :
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I think we’ve reached the point where it’s time to finalize and
>>>>>>>> adopt the changes for Iceberg v3. We’ve been working toward this for 
>>>>>>>> the
>>>>>>>> last few months and have now implemented the v3 features in the Java
>>>>>>>> library to reduce the risk of needing changes or hitting problems (row
>>>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated 
>>>>>>>> some
>>>>>>>> clarifications and minor changes back into the spec from what we’ve 
>>>>>>>> learned.
>>>>>>>>
>>>>>>>> At this point, I’m confident that the spec is reasonable and
>>>>>>>> correct. Thank you to everyone working on these reference 
>>>>>>>> implementations!
>>>>>>>>
>>>>>>>> The next step is to discuss any outstanding items or concerns about
>>>>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll
>>>>>>>> start off with a couple of items:
>>>>>>>>
>>>>>>>> One potential concern is that the upstream Variant spec hasn’t yet
>>>>>>>> been finalized by the Parquet community, but we’ve built a full,
>>>>>>>> independent implementation in Iceberg to validate the spec. I think the
>>>>>>>> Parquet community is primarily waiting on getting the PRs in to have a 
>>>>>>>> Java
>>>>>>>> reference implementation, so the risk of changes to the Variant spec is
>>>>>>>> small.
>>>>>>>>
>>>>>>>> There’s also an on-going vote to add encryption keys in support of
>>>>>>>> full table encryption that I think we want to get in.
>>>>>>>>
>>>>>>>> Any other items we may want to clear up?
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to