At least for discussion purposes, I think the REST spec (and any spec that
involves code that will ultimately be consumed) is probably a harder
conversation.

I agree that it’s a very different conversation and probably out of scope
for the table v3 spec.

I’m undecided if minor releases are necessary for non-code specs, this
seems like it might be too much overhead and might not provide a ton of
value (maybe you could elaborate on the value you see in it?).

Thanks for bringing up the point about minor versions. It’s critical to
keep in mind that we’re talking about two different types of changes. For
the v3 discussion, I think the question is what changes we want to add in
v3, which is an opportunity to group together forward-incompatible changes
that require new behavior to read tables correctly.

It’s great to discuss whether we want to change how we version the spec and
see if we want to release breaking changes more often. I *think* that was
Micah’s original intent for bringing up a regular release cadence for the
spec. But we should also be aware that this is a separate discussion. Most
of the points that Micah raised are covered by our existing process for new
*major* versions:

   1. Add changes to the spec such that they are clearly attached to a
   future version
   2. Implement the changes in at least one implementation, probably the
   reference implementation
   3. When we have accumulated enough breaking changes, vote to adopt the
   new version

There are differences that we may choose to change, like adding the changes
to the spec rather than keeping them in PRs. And we may want to introduce a
regular cadence to make the last step more predictable. Those are great
discussions to have, but right now we know that we have changes we want to
get into a v3 in the next few months. I suggest keeping those things
separate — Micah, would you mind starting a separate thread so this one can
focus on v3?

I also see that if we were to go with Micah’s suggestion, it has an impact
on the decisions that we need to make for the v3 release. But I think that
even if we were to have a regular release cadence, it would still make
sense to group features like new types together because it makes the
versions easier to understand and limits the overall impact in the
implementations.

Ryan

On Fri, Aug 2, 2024 at 11:39 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I have been a big advocate for releasing all the Iceberg specs regularly,
>> and just follow a normal product release cycle with major and minor
>> releases. I touched a bit of the reasoning in the thread for fixing stats
>> fields in REST spec [1]. This helps a lot with engines that do not use any
>> Iceberg open source library and just look at a spec and implement it. With
>> a regular release, they can have a stable version to look into, rather than
>> a spec that is changing all the time within the same version.
>
>
> At least for discussion purposes, I think the REST spec (and any spec that
> involves code that will ultimately be consumed) is probably a harder
> conversation.  I'm undecided if minor releases are necessary for non-code
> specs, this seems like it might be too much overhead and might not provide
> a ton of value (maybe you could elaborate on the value you see in it?).
>
>
>> I think Fokko brought up a point that "this will introduce a process that
>> will slow the evolution down", which is true because you need to spend
>> additional effort and release it. And without a reference implementation,
>> it is hard to say if the spec is mature enough to be released, which again
>> makes it potentially tied to the release cycle of at least the Java library.
>
>
> Sorry I think I missed Fokko's argument on the linked thread.  In my mind,
> the order of operations on non-code spec changes would be:
>
> 1.  Spec change is proposed/reviewed and agreed upon but not merged.
> 2.  Reference implementation happens (possibly with revisions if
> implementation challenges arise).
> 3.  Reference implementation is merged
> 4.  Spec change is merged.
> 5.  Spec is officially  "released" at some normal cadence (or in theory it
> could be done immediately).
>
> Steps 3 and 4 could happen simultaneously, or 4 could potentially have
> some lag to it to allow for further feedback (i.e. letting reference
> implementation be released) and revision.
>
> If step 5 is done immediately after step 4, I don't think this would slow
> down evolution (but comes at the cost of more versions).  Part of step five
> would necessitate changing code for any incomplete implementations to only
> be turned on in the next revision (or larger features could be worked on in
> a separate branch to avoid this complication).
>
> Thanks,
> Micah
>
>
>
>
> On Fri, Aug 2, 2024 at 9:10 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> > An alternative view: Would it make sense to start releasing the table
>> specification on a regular cadence (e.g. quarterly, every 6 months or
>> yearly)?
>>
>> I have been a big advocate for releasing all the Iceberg specs regularly,
>> and just follow a normal product release cycle with major and minor
>> releases. I touched a bit of the reasoning in the thread for fixing stats
>> fields in REST spec [1]. This helps a lot with engines that do not use any
>> Iceberg open source library and just look at a spec and implement it. With
>> a regular release, they can have a stable version to look into, rather than
>> a spec that is changing all the time within the same version.
>>
>> It is important to note that minor spec versions will not be leveraged in
>> implementations like how we have logics right now for switching behaviors
>> depending on major versions. It is purely for the purpose of making more
>> incremental progress on the spec, and providing stable spec versions for
>> other reference implementations. Otherwise, the branches in the codebase to
>> handle different versions easily get out of control.
>>
>> I think Fokko brought up a point that "this will introduce a process that
>> will slow the evolution down", which is true because you need to spend
>> additional effort and release it. And without a reference implementation,
>> it is hard to say if the spec is mature enough to be released, which again
>> makes it potentially tied to the release cycle of at least the Java library.
>>
>> Curious what people think.
>>
>> Best,
>> Jack Ye
>>
>> [1] https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf
>>
>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> It sounds like most of the opinions so far are waiting for the scope of
>>> work to finish before finalizing the specification.
>>>
>>> An alternative view: Would it make sense to start releasing the table
>>> specification on a regular cadence (e.g. quarterly, every 6 months or
>>> yearly)?  I think the problem with waiting for features to get in is that
>>> priorities change and things take longer than expected, thus leaving the
>>> actual finalization of the specification in limbo and probably adds to
>>> project management overhead.   If the specification is released regularly
>>> then it means features can always be included in the next release without
>>> too much delay hopefully.  The main downside I can think of in this
>>> approach is having to have more branches in code to handle different
>>> versions.
>>>
>>> One corollary to this approach is spec changes shouldn't be merged
>>> before their implementations are ready.
>>>
>>>   - At least one complete reference implementation should exist.
>>>
>>>
>>> For more complicated features I think at some point soon it might be
>>> worth considering two implementations (or at least 1 full implementation
>>> and 1 read only implementation) to make sure there aren't compatibility
>>> issues/misunderstandings in the specification (e.g. I think Variant and
>>> Geography fall into this category).
>>>
>>> Cheers,
>>> Micah
>>>
>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I think this all sounds good, the real question is whether or not we
>>>> have someone to actively work on the proposals. I think for things like
>>>> Default Values and Geo Types we have folks actively working on them so it's
>>>> not a big deal.
>>>>
>>>> On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho <szehon.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Sorry I missed the sync this morning (sick), I'd like to push for geo
>>>>> too.
>>>>>
>>>>> I think on this front as per the last sync, Ryan recommended to wait
>>>>> for Parquet support to land, to avoid having two versions on Iceberg side
>>>>> (Iceberg-native vs Parquet-native).  Parquet support is being actively
>>>>> worked on iiuc: https://github.com/apache/parquet-format/pull/240 .
>>>>> But it would bind V3 to the parquet-format release timeline, unless we
>>>>> start with iceberg-native support first and move later (as we originally
>>>>> proposed).
>>>>>
>>>>> Thanks,
>>>>> Szehon
>>>>>
>>>>> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Another feature that was planned for V3 is support for default values.
>>>>>> Spec doc update was already merged a while ago [1]. Implementation is
>>>>>> ongoing in this PR [2].
>>>>>>
>>>>>> [1] https://iceberg.apache.org/spec/#default-values
>>>>>> [2] https://github.com/apache/iceberg/pull/9502
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer
>>>>>> <russell.spit...@gmail.com> wrote:
>>>>>> >
>>>>>> > Thanks for bringing this up, I would say that from my perspective I
>>>>>> have time to really push through hopefully two things
>>>>>> >
>>>>>> > Variant Type and
>>>>>> > Row Lineage (which I will have a proposal for on the mailing list
>>>>>> next week)
>>>>>> >
>>>>>> > I'm using the Project to try to track logistics and minutia
>>>>>> required for the new spec version but I would like to bring other work in
>>>>>> there as well so we can get a clear picture of what is actually being
>>>>>> actively worked on.
>>>>>> >
>>>>>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble <
>>>>>> jacobmar...@influxdata.com> wrote:
>>>>>> >>
>>>>>> >> Good morning,
>>>>>> >>
>>>>>> >> To continue the community sync today when format version 3 was
>>>>>> discussed.
>>>>>> >>
>>>>>> >> Questions answered by consensus:
>>>>>> >> - Format version releases should _not_ be tied to Iceberg version
>>>>>> releases.
>>>>>> >> - Several planned features will require format version releases;
>>>>>> the process shouldn't be onerous.
>>>>>> >>
>>>>>> >> Unanswered questions:
>>>>>> >> - What will be included in format version 3?
>>>>>> >>   - What is a reasonable target date?
>>>>>> >>   - How to track progress? Today, there are two public lists:
>>>>>> >>     - GH milestone: https://github.com/apache/iceberg/milestone/42
>>>>>> >>     - GH project: https://github.com/orgs/apache/projects/377
>>>>>> >> - What is required of a feature in order to be included in any
>>>>>> adopted format version?
>>>>>> >>   - At least one complete reference implementation should exist.
>>>>>> >>     - Java is the reference implementation by convention; that's
>>>>>> OK, but not perfect. Should Java be the reference implementation by 
>>>>>> mandate?
>>>>>> >>
>>>>>> >> Have I missed anything?
>>>>>> >>
>>>>>> >> --
>>>>>> >> Jacob Marble
>>>>>>
>>>>>

-- 
Ryan Blue
Databricks

Reply via email to