Re: [DISCUSS] adoption of format version 3

Ryan Blue Wed, 07 Aug 2024 14:30:37 -0700

we can still discuss the remaining items in the Iceberg geometry proposal:
expression, partition transform


For these two, I think that we can make them backward-compatible. If we
update the spec to allow adding transforms between major releases, then we
don’t need to have the transform done by v3. Similarly, the expressions are
entirely in the Iceberg library so adding them after v3 is backward
compatible. We don’t think we need v3 to depend on either one!

if the parquet geometry is accepted, we can quickly update the Iceberg
proposal and start a vote?

It seems reasonable to me. If the Parquet type is done, we should be able
to get the geometry type in.

Ryan

On Tue, Aug 6, 2024 at 6:30 PM Jia Yu <[email protected]> wrote:

> Hi Ryan, Szehon, and other folks in the thread,
>
> Thanks for summarizing this. I'd love to have the geometry type in V3
> spec but I'd like to also understand the expected timeline of V3. Is
> there a rough cutoff time?
>
> Currently, we are actively working on the Parquet Geometry type
> proposal. We have resolved almost all concerns in the spec and the
> implementation is in progress. I think this will be accepted in a
> month (of course, pending the vote in the Parquet community).
>
> On the other hand, we can still discuss the remaining items in the
> Iceberg geometry proposal: expression, partition transform. But I
> think we have addressed the concerns on those items? So if the parquet
> geometry is accepted, we can quickly update the Iceberg proposal and
> start a vote? Of course, I'd love to have greelights from a few more
> Iceberg PMC members so we will be more confident on the timeline of
> this proposal.
>
> Thanks,
> Jia
>
> On Tue, Aug 6, 2024 at 4:50 PM Szehon Ho <[email protected]> wrote:
> >
> > It makes sense to me, thanks for summarizing it, it's an exciting list
> of new features.
> >
> > For Geo, I will let Wherobots engineers (Jia Yu and others) working
> there to comment, but geo type could take more time, if we wait for
> Parquet-Format change, followed by Parquet implementation release.
> >
> > +1 about multi-value transform, I think it will be great and do-able to
> get those in, the spec allows their existence but its just waiting
> implementation/ review.
> >
> > Thanks
> > Szehon
> >
> > On Tue, Aug 6, 2024 at 4:42 PM Ryan Blue <[email protected]>
> wrote:
> >>
> >> I’ve been going through the list I’ve accumulated for v3 changes and I
> think we do have a fairly clear set of things that people are working on.
> There are two main areas. The first is centered around types and extending
> existing metadata:
> >>
> >> Add new types: timestamp(ns), variant, blob, and null
> >> Add new type promotion: long to timestamp,
> boolean/int/long/date/time/timestamp/uuid to string, null to anything, most
> types to variant
> >> Add default value support via initial-default, write-default
> >> Add multi-arg transforms (multi-column bucket, zorder)
> >>
> >> Then there are a few bigger items that have people actively working:
> >>
> >> Row-level tracking metadata
> >> Improvements for position delete performance
> >> Encryption metadata
> >> Geo support: geometry type, xz transform, and geo predicates
> >>
> >> I propose that we target the first set of things since that’s a group
> of similar changes. It makes sense (at least to me) to add new types in a
> group, and it also makes sense to extend type capabilities (defaults and
> promotions) at the same time. (Also, we can choose to exclude blob if it is
> a large amount of work)
> >>
> >> I’d include multi-arg transforms in that group since the design is well
> written and nearly done. And we can make sure that there is a
> backward-compatible way to add new transforms between major releases. The
> Java library can currently handle new transforms and if we get those
> details into the spec then we don’t need to get the specifics of multi-arg
> bucketing as part of the v3 release.
> >>
> >> For the second group of projects, I suggest that we continue to
> actively work on them and try to get at least 2 of them in. Encryption
> metadata is quite close and just needs a few table-level additions to the
> metadata file. The changes for row-level tracking and position delete
> performance should be reasonably sized.
> >>
> >> I’d also love to see the geo support in v3, but that’s also a
> well-scoped feature that could be a v4 if it isn’t going to make it in
> time. My main concern here is the size of the changes where I don’t have
> much context.
> >>
> >> In summary, I’d say we should aim to include the new types, promotion,
> default values, and multi-arg transforms. Then include any of the larger
> items that are ready in time. Does that sound reasonable?
> >>
> >> Ryan
> >>
> >>
> >> On Mon, Aug 5, 2024 at 3:20 PM Micah Kornfield <[email protected]>
> wrote:
> >>>>
> >>>> I suggest keeping those things separate — Micah, would you mind
> starting a separate thread so this one can focus on v3?
> >>>
> >>>
> >>> Yes I'll start another thread on this post V3, to allow for focus on
> closing off V3 with the current process (and see if there is interest in
> trying something new for v4.
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> On Mon, Aug 5, 2024 at 12:17 PM Ryan Blue <[email protected]>
> wrote:
> >>>>
> >>>> At least for discussion purposes, I think the REST spec (and any spec
> that involves code that will ultimately be consumed) is probably a harder
> conversation.
> >>>>
> >>>> I agree that it’s a very different conversation and probably out of
> scope for the table v3 spec.
> >>>>
> >>>> I’m undecided if minor releases are necessary for non-code specs,
> this seems like it might be too much overhead and might not provide a ton
> of value (maybe you could elaborate on the value you see in it?).
> >>>>
> >>>> Thanks for bringing up the point about minor versions. It’s critical
> to keep in mind that we’re talking about two different types of changes.
> For the v3 discussion, I think the question is what changes we want to add
> in v3, which is an opportunity to group together forward-incompatible
> changes that require new behavior to read tables correctly.
> >>>>
> >>>> It’s great to discuss whether we want to change how we version the
> spec and see if we want to release breaking changes more often. I think
> that was Micah’s original intent for bringing up a regular release cadence
> for the spec. But we should also be aware that this is a separate
> discussion. Most of the points that Micah raised are covered by our
> existing process for new major versions:
> >>>>
> >>>> Add changes to the spec such that they are clearly attached to a
> future version
> >>>> Implement the changes in at least one implementation, probably the
> reference implementation
> >>>> When we have accumulated enough breaking changes, vote to adopt the
> new version
> >>>>
> >>>> There are differences that we may choose to change, like adding the
> changes to the spec rather than keeping them in PRs. And we may want to
> introduce a regular cadence to make the last step more predictable. Those
> are great discussions to have, but right now we know that we have changes
> we want to get into a v3 in the next few months. I suggest keeping those
> things separate — Micah, would you mind starting a separate thread so this
> one can focus on v3?
> >>>>
> >>>> I also see that if we were to go with Micah’s suggestion, it has an
> impact on the decisions that we need to make for the v3 release. But I
> think that even if we were to have a regular release cadence, it would
> still make sense to group features like new types together because it makes
> the versions easier to understand and limits the overall impact in the
> implementations.
> >>>>
> >>>> Ryan
> >>>>
> >>>>
> >>>> On Fri, Aug 2, 2024 at 11:39 AM Micah Kornfield <
> [email protected]> wrote:
> >>>>>>
> >>>>>> I have been a big advocate for releasing all the Iceberg specs
> regularly, and just follow a normal product release cycle with major and
> minor releases. I touched a bit of the reasoning in the thread for fixing
> stats fields in REST spec [1]. This helps a lot with engines that do not
> use any Iceberg open source library and just look at a spec and implement
> it. With a regular release, they can have a stable version to look into,
> rather than a spec that is changing all the time within the same version.
> >>>>>
> >>>>>
> >>>>> At least for discussion purposes, I think the REST spec (and any
> spec that involves code that will ultimately be consumed) is probably a
> harder conversation.  I'm undecided if minor releases are necessary for
> non-code specs, this seems like it might be too much overhead and might not
> provide a ton of value (maybe you could elaborate on the value you see in
> it?).
> >>>>>
> >>>>>>
> >>>>>> I think Fokko brought up a point that "this will introduce a
> process that will slow the evolution down", which is true because you need
> to spend additional effort and release it. And without a reference
> implementation, it is hard to say if the spec is mature enough to be
> released, which again makes it potentially tied to the release cycle of at
> least the Java library.
> >>>>>
> >>>>>
> >>>>> Sorry I think I missed Fokko's argument on the linked thread.  In my
> mind, the order of operations on non-code spec changes would be:
> >>>>>
> >>>>> 1.  Spec change is proposed/reviewed and agreed upon but not merged.
> >>>>> 2.  Reference implementation happens (possibly with revisions if
> implementation challenges arise).
> >>>>> 3.  Reference implementation is merged
> >>>>> 4.  Spec change is merged.
> >>>>> 5.  Spec is officially  "released" at some normal cadence (or in
> theory it could be done immediately).
> >>>>>
> >>>>> Steps 3 and 4 could happen simultaneously, or 4 could potentially
> have some lag to it to allow for further feedback (i.e. letting reference
> implementation be released) and revision.
> >>>>>
> >>>>> If step 5 is done immediately after step 4, I don't think this would
> slow down evolution (but comes at the cost of more versions).  Part of step
> five would necessitate changing code for any incomplete implementations to
> only be turned on in the next revision (or larger features could be worked
> on in a separate branch to avoid this complication).
> >>>>>
> >>>>> Thanks,
> >>>>> Micah
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Aug 2, 2024 at 9:10 AM Jack Ye <[email protected]> wrote:
> >>>>>>
> >>>>>> > An alternative view: Would it make sense to start releasing the
> table specification on a regular cadence (e.g. quarterly, every 6 months or
> yearly)?
> >>>>>>
> >>>>>> I have been a big advocate for releasing all the Iceberg specs
> regularly, and just follow a normal product release cycle with major and
> minor releases. I touched a bit of the reasoning in the thread for fixing
> stats fields in REST spec [1]. This helps a lot with engines that do not
> use any Iceberg open source library and just look at a spec and implement
> it. With a regular release, they can have a stable version to look into,
> rather than a spec that is changing all the time within the same version.
> >>>>>>
> >>>>>> It is important to note that minor spec versions will not be
> leveraged in implementations like how we have logics right now for
> switching behaviors depending on major versions. It is purely for the
> purpose of making more incremental progress on the spec, and providing
> stable spec versions for other reference implementations. Otherwise, the
> branches in the codebase to handle different versions easily get out of
> control.
> >>>>>>
> >>>>>> I think Fokko brought up a point that "this will introduce a
> process that will slow the evolution down", which is true because you need
> to spend additional effort and release it. And without a reference
> implementation, it is hard to say if the spec is mature enough to be
> released, which again makes it potentially tied to the release cycle of at
> least the Java library.
> >>>>>>
> >>>>>> Curious what people think.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jack Ye
> >>>>>>
> >>>>>> [1]
> https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf
> >>>>>>
> >>>>>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield <
> [email protected]> wrote:
> >>>>>>>
> >>>>>>> It sounds like most of the opinions so far are waiting for the
> scope of work to finish before finalizing the specification.
> >>>>>>>
> >>>>>>> An alternative view: Would it make sense to start releasing the
> table specification on a regular cadence (e.g. quarterly, every 6 months or
> yearly)?  I think the problem with waiting for features to get in is that
> priorities change and things take longer than expected, thus leaving the
> actual finalization of the specification in limbo and probably adds to
> project management overhead.   If the specification is released regularly
> then it means features can always be included in the next release without
> too much delay hopefully.  The main downside I can think of in this
> approach is having to have more branches in code to handle different
> versions.
> >>>>>>>
> >>>>>>> One corollary to this approach is spec changes shouldn't be merged
> before their implementations are ready.
> >>>>>>>
> >>>>>>>>   - At least one complete reference implementation should exist.
> >>>>>>>
> >>>>>>>
> >>>>>>> For more complicated features I think at some point soon it might
> be worth considering two implementations (or at least 1 full implementation
> and 1 read only implementation) to make sure there aren't compatibility
> issues/misunderstandings in the specification (e.g. I think Variant and
> Geography fall into this category).
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Micah
> >>>>>>>
> >>>>>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer <
> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>> I think this all sounds good, the real question is whether or not
> we have someone to actively work on the proposals. I think for things like
> Default Values and Geo Types we have folks actively working on them so it's
> not a big deal.
> >>>>>>>>
> >>>>>>>> On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho <
> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> Sorry I missed the sync this morning (sick), I'd like to push
> for geo too.
> >>>>>>>>>
> >>>>>>>>> I think on this front as per the last sync, Ryan recommended to
> wait for Parquet support to land, to avoid having two versions on Iceberg
> side (Iceberg-native vs Parquet-native).  Parquet support is being actively
> worked on iiuc: https://github.com/apache/parquet-format/pull/240 .  But
> it would bind V3 to the parquet-format release timeline, unless we start
> with iceberg-native support first and move later (as we originally
> proposed).
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Szehon
> >>>>>>>>>
> >>>>>>>>> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa <
> [email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Another feature that was planned for V3 is support for default
> values.
> >>>>>>>>>> Spec doc update was already merged a while ago [1].
> Implementation is
> >>>>>>>>>> ongoing in this PR [2].
> >>>>>>>>>>
> >>>>>>>>>> [1] https://iceberg.apache.org/spec/#default-values
> >>>>>>>>>> [2] https://github.com/apache/iceberg/pull/9502
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Walaa.
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer
> >>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>> >
> >>>>>>>>>> > Thanks for bringing this up, I would say that from my
> perspective I have time to really push through hopefully two things
> >>>>>>>>>> >
> >>>>>>>>>> > Variant Type and
> >>>>>>>>>> > Row Lineage (which I will have a proposal for on the mailing
> list next week)
> >>>>>>>>>> >
> >>>>>>>>>> > I'm using the Project to try to track logistics and minutia
> required for the new spec version but I would like to bring other work in
> there as well so we can get a clear picture of what is actually being
> actively worked on.
> >>>>>>>>>> >
> >>>>>>>>>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble <
> [email protected]> wrote:
> >>>>>>>>>> >>
> >>>>>>>>>> >> Good morning,
> >>>>>>>>>> >>
> >>>>>>>>>> >> To continue the community sync today when format version 3
> was discussed.
> >>>>>>>>>> >>
> >>>>>>>>>> >> Questions answered by consensus:
> >>>>>>>>>> >> - Format version releases should _not_ be tied to Iceberg
> version releases.
> >>>>>>>>>> >> - Several planned features will require format version
> releases; the process shouldn't be onerous.
> >>>>>>>>>> >>
> >>>>>>>>>> >> Unanswered questions:
> >>>>>>>>>> >> - What will be included in format version 3?
> >>>>>>>>>> >>   - What is a reasonable target date?
> >>>>>>>>>> >>   - How to track progress? Today, there are two public lists:
> >>>>>>>>>> >>     - GH milestone:
> https://github.com/apache/iceberg/milestone/42
> >>>>>>>>>> >>     - GH project:
> https://github.com/orgs/apache/projects/377
> >>>>>>>>>> >> - What is required of a feature in order to be included in
> any adopted format version?
> >>>>>>>>>> >>   - At least one complete reference implementation should
> exist.
> >>>>>>>>>> >>     - Java is the reference implementation by convention;
> that's OK, but not perfect. Should Java be the reference implementation by
> mandate?
> >>>>>>>>>> >>
> >>>>>>>>>> >> Have I missed anything?
> >>>>>>>>>> >>
> >>>>>>>>>> >> --
> >>>>>>>>>> >> Jacob Marble
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Databricks
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Databricks
>


-- 
Ryan Blue
Databricks

Re: [DISCUSS] adoption of format version 3

Reply via email to