Re: [DISCUSS] Finalizing the v3 spec

Anton Okolnychyi Wed, 07 May 2025 20:31:31 -0700

Steven, that may be a good point to add to ensure the metadata is properly
maintained. If I remember correctly, the Spark implementation already drops
old DVs in DELETE/UPDATE/MERGE but the data compaction wasn't doing it
originally. I wonder if we fixed it. Eduard may know more.


- Anton

ср, 7 трав. 2025 р. о 16:29 Steven Wu <[email protected]> пише:

> For the delete vection change, should we add the following
> constraint/requirement for the write path in the spec? I don't know if this
> is already the behavior of the Spark implementation.
>
> "if a data file is removed from the table, the corresponding DV reference
> must also be removed from delete manifest file"
>
> This constraint is to guarantee no orphaned DVs in the table state. It
> will be cheaper to calculate *accurate* table row count. Just iterate
> through the manifest files (data and delete) using add and subtraction
> calculations. There is no need to validate DVs if the referenced data files
> are still part of the table, which can be a little more expensive.
>
>
>
> On Tue, May 6, 2025 at 9:18 AM Manu Zhang <[email protected]> wrote:
>
>> Thanks for clarification Ryan.
>>
>> I'm aware of the major changes, but I find it hard to go through all the
>> related descriptions which are scattered all over the place.
>>
>> Manu
>>
>> On Tue, May 6, 2025 at 11:24 PM Ryan Blue <[email protected]> wrote:
>>
>>> Manu,
>>>
>>> We aren't currently voting. We are discussing any outstanding items to
>>> address before we close v3 to further changes and adopt the existing v3
>>> changes. Right now, the open item is to clarify NaN behavior in geometry
>>> and geography, PR #12956 <https://github.com/apache/iceberg/pull/12956>.
>>>
>>> Thanks for noting that the row lineage changes should be added to the
>>> appendix, I'll open a PR to add it. That appendix is an area to highlight
>>> things that have changed across versions, but an omission does not alter
>>> the requirements elsewhere the spec. The changes we are discussing are the
>>> things that are noted as part of v3 in the spec. The major additions are
>>> new types, DVs, and row lineage.
>>>
>>> Ryan
>>>
>>> On Tue, May 6, 2025 at 3:32 AM Manu Zhang <[email protected]>
>>> wrote:
>>>
>>>> I'm wondering what changes we are voting for here. Is it everything
>>>> related to
>>>> https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities 
>>>> from
>>>> the table spec?
>>>> How about changes to other specs?
>>>>
>>>> Do we summarize all the changes in
>>>> https://iceberg.apache.org/spec/#appendix-e-format-version-changes? It
>>>> looks row lineage is missing here.
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>> On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <[email protected]>
>>>> wrote:
>>>>
>>>>> DVs in Spark seem to behave reasonably, serving as a reference
>>>>> implementation of the V3 spec. There are areas for optimization/refinement
>>>>> but nothing was observed that requires changing the spec. I would also 
>>>>> like
>>>>> to add the notion of content overhead/metadata (for Puffin/Parquet 
>>>>> footers)
>>>>> to manifests to optimize DVs maintenance. That said, it is optional
>>>>> information and can be added after finalizing V3.
>>>>>
>>>>> - Anton
>>>>>
>>>>> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <[email protected]>
>>>>> пише:
>>>>>
>>>>>> Hi Ryan
>>>>>>
>>>>>> All good for the spec. The idea for release is just a help to "double
>>>>>> check" the spec is good (we already saw some slightly changes on the
>>>>>> spec while working on release). I think we can be "confident" that we
>>>>>> won't have unexpected change.
>>>>>>
>>>>>> Thanks !
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <[email protected]> wrote:
>>>>>> >
>>>>>> > Thanks, everyone! Looks like there are a few points to discuss.
>>>>>> >
>>>>>> > [JB] Maybe a release with the core updated before announcing spec
>>>>>> v3 officially would be a good idea ?
>>>>>> > [Manu] Agree with Russell and JB that we make a “RC” release for V3
>>>>>> spec to test implementations, compatibility, etc before finalizing it.
>>>>>> >
>>>>>> > As Fokko noted, we are currently concerned about the spec and not
>>>>>> implementations. The reason is that implementation work before the spec 
>>>>>> is
>>>>>> finalized is to reduce risk and build confidence that the spec is 
>>>>>> complete
>>>>>> and correct. Once that’s done, it is important to finalize the changes. 
>>>>>> If
>>>>>> we don’t finalize the changes, then implementations don’t know how/what
>>>>>> build and cannot plan when they will fully support v3 — because it could
>>>>>> change. Most of the work in other implementations will take place after 
>>>>>> the
>>>>>> spec is adopted.
>>>>>> >
>>>>>> > Our process for building confidence in new spec versions is to
>>>>>> update the spec with pending changes, implement them to validate (and
>>>>>> clarify or adjust as needed), and vote to adopt the new version as a
>>>>>> confirmation that we agree that the spec changes are reasonable and 
>>>>>> correct.
>>>>>> >
>>>>>> > We’ve already voted to accept the pending v3 changes into the spec,
>>>>>> so the changes have already been in a candidate state for quite some time
>>>>>> to work on implementations. Now we’re at the point where we’ve 
>>>>>> implemented
>>>>>> the features and, in my opinion, have demonstrated the spec changes are
>>>>>> correct and complete.
>>>>>> >
>>>>>> > To that end, the question I’m raising in this thread is “what areas
>>>>>> and features need further validation?”
>>>>>> >
>>>>>> > I appreciate the ideas here — releasing will assist other
>>>>>> implementations — but I don’t think that changes the question for this
>>>>>> thread. The aim is to identify specific risks and blockers that we need 
>>>>>> to
>>>>>> tackle before adopting the changes.
>>>>>> >
>>>>>> > [Russell] We should probably come to a resolution on the compressed
>>>>>> metadata.json name as well, although that’s mostly retroactive. V3 would 
>>>>>> be
>>>>>> the place where we could officially change the naming convention.
>>>>>> >
>>>>>> > I don’t think that this affects v3, but we should agree before
>>>>>> moving on. The only part of the spec that would depend on this is the 
>>>>>> paths
>>>>>> used by file system tables and that strategy is deprecated. We should 
>>>>>> only
>>>>>> document for clarify (we can’t change it) and I think we can do that any
>>>>>> time.
>>>>>> >
>>>>>> > For the conventions used in catalog tables, I don’t think that we
>>>>>> want to have requirements in the spec for file naming. We’ve avoided that
>>>>>> in the past and it isn’t needed. It’s nice to have a convention in
>>>>>> implementation notes, but there are other ways to handle this like magic
>>>>>> bytes and catalog tracking.
>>>>>> >
>>>>>> > [Gang] it is implicit and obvious that only bucket transform can
>>>>>> apply to multi-arg transform, it is still unclear the order of source
>>>>>> columns and algorithm to use to calculate the bucket value
>>>>>> >
>>>>>> > I think there is some confusion here, but Fokko may have already
>>>>>> cleared it up.
>>>>>> >
>>>>>> > Right now, there are no multi-argument transforms in the spec. We
>>>>>> have discussed adding a multi-argument bucket function, but there is not
>>>>>> currently one in the spec. In order to minimize changes required for v3, 
>>>>>> we
>>>>>> opted to update the spec to allow adding new transforms in a
>>>>>> forward-compatible way between major spec versions (implementations must
>>>>>> ignore unknown transforms).
>>>>>> >
>>>>>> > [Jia] We’re currently addressing the handling of null/NaN values
>>>>>> for X, Y, Z, and M coordinates in the Parquet format repository
>>>>>> >
>>>>>> > I agree that this is a good thing to clarify. We currently state
>>>>>> that the ranges are [-180, 180] and [-90, 90] for geography, but we 
>>>>>> should
>>>>>> state how points with NaN values are handled.
>>>>>> >
>>>>>> >
>>>>>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Hi Jia
>>>>>> >>
>>>>>> >> I feel it would be nice to get that Parquet spec clarificiation
>>>>>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3
>>>>>> spec as well, once we finalize that.
>>>>>> >>
>>>>>> >> Thanks
>>>>>> >> Szehon
>>>>>> >>
>>>>>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <[email protected]> wrote:
>>>>>> >>>
>>>>>> >>> Hi Szehon,
>>>>>> >>>
>>>>>> >>> Thanks for clarifying it.
>>>>>> >>>
>>>>>> >>> We’re currently addressing the handling of null/NaN values for X,
>>>>>> Y, Z, and M coordinates in the Parquet format repository. We’ve already
>>>>>> concluded that the spec of Parquet (same on the Iceberg side I believe)
>>>>>> only needs additional clarification to guide expected behavior:
>>>>>> https://github.com/apache/parquet-format/pull/494
>>>>>> >>>
>>>>>> >>> BTW the Parquet Geo C++ PR has been merged today:
>>>>>> https://github.com/apache/arrow/pull/45459  I believe the Parquet
>>>>>> Geo Java PR is also very close.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> Jia
>>>>>> >>>
>>>>>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <
>>>>>> [email protected]> wrote:
>>>>>> >>>>
>>>>>> >>>> Hey Ryan,
>>>>>> >>>>
>>>>>> >>>> Thanks for raising this, and I'm very excited to see V3 being
>>>>>> finalized!
>>>>>> >>>>
>>>>>> >>>>> The v3 spec for multi-arg transform only advises to use
>>>>>> `source-ids` instead of `source-id`. Although it is implicit and obvious
>>>>>> that only bucket transform can apply to multi-arg transform, it is still
>>>>>> unclear the order of source columns and algorithm to use to calculate the
>>>>>> bucket value.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> V3 now uses source IDs when there are multiple arguments and
>>>>>> source IDs when there is just one. PR can be found here. This makes the
>>>>>> serialization deterministic without knowing the format-version, 
>>>>>> simplifying
>>>>>> the readers/writers. After some discussion on the PR, we've decided to
>>>>>> leave out the multi-arg bucket transform so the V3 spec can be finalized.
>>>>>> So V3 only contains the scaffolding for multi-arg transforms.
>>>>>> >>>>
>>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>>>>> bounds and geospatial predicate to be merged:
>>>>>> https://github.com/apache/iceberg/pull/12667
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> I think it is a good idea to distinguish between the spec and
>>>>>> the actual code. If we all feel comfortable with the spec, I think we 
>>>>>> could
>>>>>> finalize it. Being comfortable also means that we know that we have a
>>>>>> working implementation, but I don't think we have to wrap up all the 
>>>>>> loose
>>>>>> ends before voting on the spec.
>>>>>> >>>>
>>>>>> >>>> At the PyIceberg side, we're also working to catch up on the V3
>>>>>> capabilities. Having a Java release that exposes these capabilities 
>>>>>> helps,
>>>>>> so we can do round-trip validation.
>>>>>> >>>>
>>>>>> >>>> Kind regards,
>>>>>> >>>> Fokko
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <[email protected]>:
>>>>>> >>>>>
>>>>>> >>>>> Hi folks,
>>>>>> >>>>>
>>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>>>>> bounds and geospatial predicate to be merged:
>>>>>> https://github.com/apache/iceberg/pull/12667
>>>>>> >>>>>
>>>>>> >>>>> Should a release with core updates include this PR?
>>>>>> >>>>>
>>>>>> >>>>> Thanks,
>>>>>> >>>>> Jia
>>>>>> >>>>>
>>>>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <
>>>>>> [email protected]> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3
>>>>>> spec to test implementations, compatibility, etc before finalizing it.
>>>>>> >>>>>>
>>>>>> >>>>>> Thanks,
>>>>>> >>>>>> Manu
>>>>>> >>>>>>
>>>>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <
>>>>>> [email protected]> wrote:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Hi Ryan
>>>>>> >>>>>>>
>>>>>> >>>>>>> It sounds good.
>>>>>> >>>>>>>
>>>>>> >>>>>>> About multi-args transforms, with the clarification we did a
>>>>>> couple of weeks ago, I think we are good.
>>>>>> >>>>>>> Maybe a release with the core updated before announcing spec
>>>>>> v3 officially would be a good idea ?
>>>>>> >>>>>>>
>>>>>> >>>>>>> Regards
>>>>>> >>>>>>> JB
>>>>>> >>>>>>>
>>>>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <[email protected]> a
>>>>>> écrit :
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Hi everyone,
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize
>>>>>> and adopt the changes for Iceberg v3. We’ve been working toward this for
>>>>>> the last few months and have now implemented the v3 features in the Java
>>>>>> library to reduce the risk of needing changes or hitting problems (row
>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated some
>>>>>> clarifications and minor changes back into the spec from what we’ve 
>>>>>> learned.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> At this point, I’m confident that the spec is reasonable and
>>>>>> correct. Thank you to everyone working on these reference 
>>>>>> implementations!
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> The next step is to discuss any outstanding items or
>>>>>> concerns about moving forward, and then to have a vote thread to adopt 
>>>>>> the
>>>>>> spec. I’ll start off with a couple of items:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> One potential concern is that the upstream Variant spec
>>>>>> hasn’t yet been finalized by the Parquet community, but we’ve built a 
>>>>>> full,
>>>>>> independent implementation in Iceberg to validate the spec. I think the
>>>>>> Parquet community is primarily waiting on getting the PRs in to have a 
>>>>>> Java
>>>>>> reference implementation, so the risk of changes to the Variant spec is
>>>>>> small.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> There’s also an on-going vote to add encryption keys in
>>>>>> support of full table encryption that I think we want to get in.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Any other items we may want to clear up?
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Ryan
>>>>>>
>>>>>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to