Hi Jia

I feel it would be nice to get that Parquet spec clarificiation
https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as
well, once we finalize that.

Thanks
Szehon

On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote:

> Hi Szehon,
>
> Thanks for clarifying it.
>
> We’re currently addressing the handling of null/NaN values for X, Y, Z,
> and M coordinates in the Parquet format repository. We’ve already concluded
> that the spec of Parquet (same on the Iceberg side I believe) only needs
> additional clarification to guide expected behavior:
> https://github.com/apache/parquet-format/pull/494
>
> BTW the Parquet Geo C++ PR has been merged today:
> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo
> Java PR is also very close.
>
> Thanks,
> Jia
>
> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org>
> wrote:
>
>> Hey Ryan,
>>
>> Thanks for raising this, and I'm very excited to see V3 being finalized!
>>
>> The v3 spec for multi-arg transform only advises to use `source-ids`
>>> instead of `source-id`. Although it is implicit and obvious that only
>>> bucket transform can apply to multi-arg transform, it is still unclear the
>>> order of source columns and algorithm to use to calculate the bucket value.
>>>
>>
>> V3 now uses source IDs when there are multiple arguments and source IDs
>> when there is just one. PR can be found here
>> <https://github.com/apache/iceberg/pull/12644>. This makes the
>> serialization deterministic without knowing the format-version, simplifying
>> the readers/writers. After some discussion on the PR, we've decided to
>> leave out the multi-arg bucket transform so the V3 spec can be finalized.
>> So V3 only contains the scaffolding for multi-arg transforms.
>>
>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds and
>>> geospatial predicate to be merged:
>>> https://github.com/apache/iceberg/pull/12667
>>
>>
>> I think it is a good idea to distinguish between the spec and the actual
>> code. If we all feel comfortable with the spec, I think we could finalize
>> it. Being comfortable also means that we know that we have a working
>> implementation, but I don't think we have to wrap up all the loose ends
>> before voting on the spec.
>>
>> At the PyIceberg side, we're also working to catch up on the V3
>> capabilities <https://github.com/apache/iceberg-python/issues/1818>.
>> Having a Java release that exposes these capabilities helps, so we can do
>> round-trip validation.
>>
>> Kind regards,
>> Fokko
>>
>>
>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>:
>>
>>> Hi folks,
>>>
>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds
>>> and geospatial predicate to be merged:
>>> https://github.com/apache/iceberg/pull/12667
>>>
>>> Should a release with core updates include this PR?
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com>
>>> wrote:
>>>
>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to
>>>> test implementations, compatibility, etc before finalizing it.
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Hi Ryan
>>>>>
>>>>> It sounds good.
>>>>>
>>>>> About multi-args transforms, with the clarification we did a couple of
>>>>> weeks ago, I think we are good.
>>>>> Maybe a release with the core updated before announcing spec v3
>>>>> officially would be a good idea ?
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit :
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I think we’ve reached the point where it’s time to finalize and adopt
>>>>>> the changes for Iceberg v3. We’ve been working toward this for the last 
>>>>>> few
>>>>>> months and have now implemented the v3 features in the Java library to
>>>>>> reduce the risk of needing changes or hitting problems (row lineage 
>>>>>> support
>>>>>> in Spark 3.5 just went in!). We’ve also incorporated some clarifications
>>>>>> and minor changes back into the spec from what we’ve learned.
>>>>>>
>>>>>> At this point, I’m confident that the spec is reasonable and correct.
>>>>>> Thank you to everyone working on these reference implementations!
>>>>>>
>>>>>> The next step is to discuss any outstanding items or concerns about
>>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll
>>>>>> start off with a couple of items:
>>>>>>
>>>>>> One potential concern is that the upstream Variant spec hasn’t yet
>>>>>> been finalized by the Parquet community, but we’ve built a full,
>>>>>> independent implementation in Iceberg to validate the spec. I think the
>>>>>> Parquet community is primarily waiting on getting the PRs in to have a 
>>>>>> Java
>>>>>> reference implementation, so the risk of changes to the Variant spec is
>>>>>> small.
>>>>>>
>>>>>> There’s also an on-going vote to add encryption keys in support of
>>>>>> full table encryption that I think we want to get in.
>>>>>>
>>>>>> Any other items we may want to clear up?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>

Reply via email to