Re: [DISCUSS] Finalizing the v3 spec

Jean-Baptiste Onofré Fri, 02 May 2025 23:23:49 -0700

Hi Ryan

All good for the spec. The idea for release is just a help to "double
check" the spec is good (we already saw some slightly changes on the
spec while working on release). I think we can be "confident" that we
won't have unexpected change.


Thanks !
Regards
JB

On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote:
>
> Thanks, everyone! Looks like there are a few points to discuss.
>
> [JB] Maybe a release with the core updated before announcing spec v3 
> officially would be a good idea ?
> [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec to 
> test implementations, compatibility, etc before finalizing it.
>
> As Fokko noted, we are currently concerned about the spec and not 
> implementations. The reason is that implementation work before the spec is 
> finalized is to reduce risk and build confidence that the spec is complete 
> and correct. Once that’s done, it is important to finalize the changes. If we 
> don’t finalize the changes, then implementations don’t know how/what build 
> and cannot plan when they will fully support v3 — because it could change. 
> Most of the work in other implementations will take place after the spec is 
> adopted.
>
> Our process for building confidence in new spec versions is to update the 
> spec with pending changes, implement them to validate (and clarify or adjust 
> as needed), and vote to adopt the new version as a confirmation that we agree 
> that the spec changes are reasonable and correct.
>
> We’ve already voted to accept the pending v3 changes into the spec, so the 
> changes have already been in a candidate state for quite some time to work on 
> implementations. Now we’re at the point where we’ve implemented the features 
> and, in my opinion, have demonstrated the spec changes are correct and 
> complete.
>
> To that end, the question I’m raising in this thread is “what areas and 
> features need further validation?”
>
> I appreciate the ideas here — releasing will assist other implementations — 
> but I don’t think that changes the question for this thread. The aim is to 
> identify specific risks and blockers that we need to tackle before adopting 
> the changes.
>
> [Russell] We should probably come to a resolution on the compressed 
> metadata.json name as well, although that’s mostly retroactive. V3 would be 
> the place where we could officially change the naming convention.
>
> I don’t think that this affects v3, but we should agree before moving on. The 
> only part of the spec that would depend on this is the paths used by file 
> system tables and that strategy is deprecated. We should only document for 
> clarify (we can’t change it) and I think we can do that any time.
>
> For the conventions used in catalog tables, I don’t think that we want to 
> have requirements in the spec for file naming. We’ve avoided that in the past 
> and it isn’t needed. It’s nice to have a convention in implementation notes, 
> but there are other ways to handle this like magic bytes and catalog tracking.
>
> [Gang] it is implicit and obvious that only bucket transform can apply to 
> multi-arg transform, it is still unclear the order of source columns and 
> algorithm to use to calculate the bucket value
>
> I think there is some confusion here, but Fokko may have already cleared it 
> up.
>
> Right now, there are no multi-argument transforms in the spec. We have 
> discussed adding a multi-argument bucket function, but there is not currently 
> one in the spec. In order to minimize changes required for v3, we opted to 
> update the spec to allow adding new transforms in a forward-compatible way 
> between major spec versions (implementations must ignore unknown transforms).
>
> [Jia] We’re currently addressing the handling of null/NaN values for X, Y, Z, 
> and M coordinates in the Parquet format repository
>
> I agree that this is a good thing to clarify. We currently state that the 
> ranges are [-180, 180] and [-90, 90] for geography, but we should state how 
> points with NaN values are handled.
>
>
> On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> wrote:
>>
>> Hi Jia
>>
>> I feel it would be nice to get that Parquet spec clarificiation 
>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as 
>> well, once we finalize that.
>>
>> Thanks
>> Szehon
>>
>> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote:
>>>
>>> Hi Szehon,
>>>
>>> Thanks for clarifying it.
>>>
>>> We’re currently addressing the handling of null/NaN values for X, Y, Z, and 
>>> M coordinates in the Parquet format repository. We’ve already concluded 
>>> that the spec of Parquet (same on the Iceberg side I believe) only needs 
>>> additional clarification to guide expected behavior: 
>>> https://github.com/apache/parquet-format/pull/494
>>>
>>> BTW the Parquet Geo C++ PR has been merged today: 
>>> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo Java 
>>> PR is also very close.
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> wrote:
>>>>
>>>> Hey Ryan,
>>>>
>>>> Thanks for raising this, and I'm very excited to see V3 being finalized!
>>>>
>>>>> The v3 spec for multi-arg transform only advises to use `source-ids` 
>>>>> instead of `source-id`. Although it is implicit and obvious that only 
>>>>> bucket transform can apply to multi-arg transform, it is still unclear 
>>>>> the order of source columns and algorithm to use to calculate the bucket 
>>>>> value.
>>>>
>>>>
>>>> V3 now uses source IDs when there are multiple arguments and source IDs 
>>>> when there is just one. PR can be found here. This makes the serialization 
>>>> deterministic without knowing the format-version, simplifying the 
>>>> readers/writers. After some discussion on the PR, we've decided to leave 
>>>> out the multi-arg bucket transform so the V3 spec can be finalized. So V3 
>>>> only contains the scaffolding for multi-arg transforms.
>>>>
>>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds and 
>>>>> geospatial predicate to be merged: 
>>>>> https://github.com/apache/iceberg/pull/12667
>>>>
>>>>
>>>> I think it is a good idea to distinguish between the spec and the actual 
>>>> code. If we all feel comfortable with the spec, I think we could finalize 
>>>> it. Being comfortable also means that we know that we have a working 
>>>> implementation, but I don't think we have to wrap up all the loose ends 
>>>> before voting on the spec.
>>>>
>>>> At the PyIceberg side, we're also working to catch up on the V3 
>>>> capabilities. Having a Java release that exposes these capabilities helps, 
>>>> so we can do round-trip validation.
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>>
>>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds and 
>>>>> geospatial predicate to be merged: 
>>>>> https://github.com/apache/iceberg/pull/12667
>>>>>
>>>>> Should a release with core updates include this PR?
>>>>>
>>>>> Thanks,
>>>>> Jia
>>>>>
>>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com> 
>>>>> wrote:
>>>>>>
>>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to 
>>>>>> test implementations, compatibility, etc before finalizing it.
>>>>>>
>>>>>> Thanks,
>>>>>> Manu
>>>>>>
>>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré 
>>>>>> <j...@nanthrax.net> wrote:
>>>>>>>
>>>>>>> Hi Ryan
>>>>>>>
>>>>>>> It sounds good.
>>>>>>>
>>>>>>> About multi-args transforms, with the clarification we did a couple of 
>>>>>>> weeks ago, I think we are good.
>>>>>>> Maybe a release with the core updated before announcing spec v3 
>>>>>>> officially would be a good idea ?
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit :
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I think we’ve reached the point where it’s time to finalize and adopt 
>>>>>>>> the changes for Iceberg v3. We’ve been working toward this for the 
>>>>>>>> last few months and have now implemented the v3 features in the Java 
>>>>>>>> library to reduce the risk of needing changes or hitting problems (row 
>>>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated 
>>>>>>>> some clarifications and minor changes back into the spec from what 
>>>>>>>> we’ve learned.
>>>>>>>>
>>>>>>>> At this point, I’m confident that the spec is reasonable and correct. 
>>>>>>>> Thank you to everyone working on these reference implementations!
>>>>>>>>
>>>>>>>> The next step is to discuss any outstanding items or concerns about 
>>>>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll 
>>>>>>>> start off with a couple of items:
>>>>>>>>
>>>>>>>> One potential concern is that the upstream Variant spec hasn’t yet 
>>>>>>>> been finalized by the Parquet community, but we’ve built a full, 
>>>>>>>> independent implementation in Iceberg to validate the spec. I think 
>>>>>>>> the Parquet community is primarily waiting on getting the PRs in to 
>>>>>>>> have a Java reference implementation, so the risk of changes to the 
>>>>>>>> Variant spec is small.
>>>>>>>>
>>>>>>>> There’s also an on-going vote to add encryption keys in support of 
>>>>>>>> full table encryption that I think we want to get in.
>>>>>>>>
>>>>>>>> Any other items we may want to clear up?
>>>>>>>>
>>>>>>>> Ryan

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to