Re: [DISCUSS] Variant Spec Location

Micah Kornfield Thu, 15 Aug 2024 08:04:40 -0700

>
> I agree that it would be beneficial to make a sub-project, the main
> problem is political and not logistic. I've been asking for movement from
> other relative projects for a month and we simply haven't gotten anywhere.



I just wanted to double check that these issues were brought directly to
the spark community (i.e. a discussion thread on the Spark developer
mailing list) and not via backchannels.

I'm not sure the outcome would be different and I don't think this should
block forking the spec, but we should make sure that the decision is
publicly documented within both communities.

Thanks,
Micah

On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <[email protected]>
wrote:

> @Gang Wu
>
> I agree that it would be beneficial to make a sub-project, the main
> problem is political and not logistic. I've been asking for movement from
> other relative projects for a month and we simply haven't gotten anywhere.
> I don't think there is anything that would stop us from moving to a joint
> project in the future and if you know of some way of encouraging that
> movement from other relevant parties I would be glad to collaborate in
> doing that. One thing that I don't want to do is have the Iceberg project
> stay in a holding pattern without any clear roadmap as to how to proceed.
>
> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <[email protected]> wrote:
>
>> I’m on board with copying the spec into our repository. However, as we’ve
>> talked about, it’s not just a straightforward copy—there are already some
>> divergences. Some of them are under discussion. Iceberg is definitely the
>> best place for these specs. Engines like Trino and Flink can then rely on
>> the Iceberg specs as a solid foundation.
>>
>> Yufei
>>
>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]> wrote:
>>
>>> Sorry for chiming in late.
>>>
>>> From the discussion in
>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>> don't quite understand why it is logistically complicated to create a
>>> sub-project to hold the variant spec and impl.
>>>
>>> IMHO, coping the variant type spec into Apache Iceberg has some
>>> deficiencies:
>>> - It is a burden to update two repos if there is a variant type spec
>>> change and will likely result in deviation if some changes do not reach
>>> agreement from both parties.
>>> - Implementers are required to keep an eye on both specs (considering
>>> proprietary engines where both Iceberg and Delta are supported).
>>> - Putting the spec and impl of variant type in Iceberg repo does lose
>>> the opportunity for better native support from file formats like Parquet
>>> and ORC.
>>>
>>> I'm not sure if it is possible to create a separate project (e.g.
>>> apache/variant-type) to make it a single point of truth. We can learn from
>>> the experience of Apache Arrow. In this fashion, different engines, table
>>> formats and file formats can follow the same spec and are free to depend on
>>> the reference implementations from apache/variant-type or implement their
>>> own.
>>>
>>> Best,
>>> Gang
>>>
>>>
>>>
>>>
>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <[email protected]> wrote:
>>>
>>>> +1 for copying the spec into our repository, I think we need to own it
>>>> fully as a part of the table spec, and we can build compatibility through
>>>> tests.
>>>>
>>>> -Jack
>>>>
>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> I'm not really in favor of linking and annotating as that just makes
>>>>> things more complicated and still is essentially forking just with more
>>>>> steps. If we just track our annotations / modifications  to a single
>>>>> commit/version then we have the same issue again but now you have to go to
>>>>> multiple sources to get the actual Spec. *In addition, our very copy
>>>>> of the Spec is going to require new types which don't exist in the Spark
>>>>> Spec which necessarily means diverging. *We will need to take up new
>>>>> primitive id's (as noted in my first email)
>>>>>
>>>>> The other issue I have is I don't think the Spark Spec is really going
>>>>> through a thorough review process from all members of the Spark community,
>>>>> I believe it probably should have gone through the SPIP but instead seems
>>>>> to have been merged without broad community involvement.
>>>>>
>>>>> The only way to truly avoid diverging is to only have a single copy of
>>>>> the spec, in our previous discussions the vast majority of Apache Iceberg
>>>>> community want it to exist here.
>>>>>
>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I'm really excited about the introduction of variant type to Iceberg,
>>>>>> but I want to raise concerns about forking the spec.
>>>>>>
>>>>>> I feel like preemptively forking would create the situation where we
>>>>>> end up diverging because there's little reason to work with both
>>>>>> communities to evolve in a way that benefits everyone.
>>>>>>
>>>>>> I would much rather point to a specific version of the spec and
>>>>>> annotate any variance in Iceberg's handling.  This would allow us to
>>>>>> continue without dividing the communities.
>>>>>>
>>>>>> If at any point there are irreconcilable differences, I would support
>>>>>> forking, but I don't feel like that should be the initial step.
>>>>>>
>>>>>> No one is excited about the possibility that the physical
>>>>>> representations end up diverging, but it feels like we're setting
>>>>>> ourselves up for that exact scenario.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to what's already being said here. It is good to copy the spec to
>>>>>>> Iceberg and add context that's specific to Iceberg, but at the same 
>>>>>>> time,
>>>>>>> we should maintain compatibility.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Fokko
>>>>>>>
>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> +1 to copy the spec into our repository. I think the best way to
>>>>>>>> keep compatibility is building integration tests.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Manu
>>>>>>>>
>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
>>>>>>>>>
>>>>>>>>> Given the differences between the supported types and the lack of
>>>>>>>>> interest from the other project, I think it is reasonable to 
>>>>>>>>> duplicate the
>>>>>>>>> specification to our repository.
>>>>>>>>> I would give very strong emphasis on sticking to the Spark spec as
>>>>>>>>> much as possible, to keep compatibility as much as possible. Maybe 
>>>>>>>>> even
>>>>>>>>> revert to a shared specification if the situation changes.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Aihua Xu <[email protected]> ezt írta (időpont: 2024. aug. 13.,
>>>>>>>>> K, 19:52):
>>>>>>>>>
>>>>>>>>>> Thanks Russell for bringing this up.
>>>>>>>>>>
>>>>>>>>>> This is the main blocker to move forward with the Variant support
>>>>>>>>>> in Iceberg and hopefully we can have a consensus. To me, I also feel 
>>>>>>>>>> it
>>>>>>>>>> makes more sense to move the spec into Iceberg rather than Spark 
>>>>>>>>>> engine
>>>>>>>>>> owns it and we try to keep it compatible with Spark spec.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Aihua
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Y’all,
>>>>>>>>>>>
>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while
>>>>>>>>>>> we were hoping to move the Variant and Shredding specifications 
>>>>>>>>>>> from Spark
>>>>>>>>>>> into Iceberg there doesn’t seem to be a lot of interest in that.
>>>>>>>>>>> Unfortunately, I think we have a number of issues with just linking 
>>>>>>>>>>> to the
>>>>>>>>>>> Spark project directly from within Iceberg and *I believe we
>>>>>>>>>>> need to copy the specifications into our repository*.
>>>>>>>>>>>
>>>>>>>>>>> There are a few reasons why i think this is necessary
>>>>>>>>>>>
>>>>>>>>>>> First, we have a divergence of types already. The Spark
>>>>>>>>>>> Specification already includes types which Iceberg has no 
>>>>>>>>>>> definition for (19,
>>>>>>>>>>> 20
>>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>>>>>>>>>>> - Interval Types) and Iceberg already has a type which is not 
>>>>>>>>>>> included
>>>>>>>>>>> within the Spark Specification (Time) and will soon have more with
>>>>>>>>>>> TimestampNS, and Geo.
>>>>>>>>>>>
>>>>>>>>>>> Second, We would like to make sure that Spark is not a hard
>>>>>>>>>>> dependency for other engines. We are working with several 
>>>>>>>>>>> implementers of
>>>>>>>>>>> the Iceberg spec and it has previously been agreed that it would be 
>>>>>>>>>>> best if
>>>>>>>>>>> the source of truth for Variant existed in an engine and file format
>>>>>>>>>>> neutral location. The Iceberg project has a good open model of 
>>>>>>>>>>> governance
>>>>>>>>>>> and, as we have seen so far discussing Variant
>>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>,
>>>>>>>>>>> open and active collaboration. This would also help as we can 
>>>>>>>>>>> strictly
>>>>>>>>>>> version our changes in-line with the rest of the Iceberg spec.
>>>>>>>>>>>
>>>>>>>>>>> Third, The Shredding spec is not quite finished and requires
>>>>>>>>>>> some group analysis and discussion before we commit it. I think 
>>>>>>>>>>> again the
>>>>>>>>>>> Iceberg community is probably the right place for this to happen as 
>>>>>>>>>>> we have
>>>>>>>>>>> already started discussions here on these topics.
>>>>>>>>>>>
>>>>>>>>>>> For these reasons I think we should go with a direct copy of the
>>>>>>>>>>> existing specification from the Spark Project and move ahead with 
>>>>>>>>>>> our
>>>>>>>>>>> discussions and modifications within Iceberg. That said, *I do
>>>>>>>>>>> not want to diverge if possible from the Spark proposal*. For
>>>>>>>>>>> example, although we do not use the Interval types above, I think 
>>>>>>>>>>> we should
>>>>>>>>>>> not reuse those type ids within our spec. Iceberg's Variant
>>>>>>>>>>> Spec types 19 and 20 would remain unused along with any other types 
>>>>>>>>>>> we
>>>>>>>>>>> think are not applicable. We should strive whenever possible to 
>>>>>>>>>>> allow for
>>>>>>>>>>> compatibility.
>>>>>>>>>>>
>>>>>>>>>>> In the interest of moving forward with this proposal I am hoping
>>>>>>>>>>> to see if anyone in the community objects to this plan going 
>>>>>>>>>>> forward or has
>>>>>>>>>>> a better alternative.
>>>>>>>>>>>
>>>>>>>>>>> As always I am thankful for your time and am eager to hear back
>>>>>>>>>>> from everyone,
>>>>>>>>>>> Russ
>>>>>>>>>>>
>>>>>>>>>>>

Re: [DISCUSS] Variant Spec Location

Reply via email to