> > I agree that it would be beneficial to make a sub-project, the main > problem is political and not logistic. I've been asking for movement from > other relative projects for a month and we simply haven't gotten anywhere.
I just wanted to double check that these issues were brought directly to the spark community (i.e. a discussion thread on the Spark developer mailing list) and not via backchannels. I'm not sure the outcome would be different and I don't think this should block forking the spec, but we should make sure that the decision is publicly documented within both communities. Thanks, Micah On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > @Gang Wu > > I agree that it would be beneficial to make a sub-project, the main > problem is political and not logistic. I've been asking for movement from > other relative projects for a month and we simply haven't gotten anywhere. > I don't think there is anything that would stop us from moving to a joint > project in the future and if you know of some way of encouraging that > movement from other relevant parties I would be glad to collaborate in > doing that. One thing that I don't want to do is have the Iceberg project > stay in a holding pattern without any clear roadmap as to how to proceed. > > On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote: > >> I’m on board with copying the spec into our repository. However, as we’ve >> talked about, it’s not just a straightforward copy—there are already some >> divergences. Some of them are under discussion. Iceberg is definitely the >> best place for these specs. Engines like Trino and Flink can then rely on >> the Iceberg specs as a solid foundation. >> >> Yufei >> >> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote: >> >>> Sorry for chiming in late. >>> >>> From the discussion in >>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >>> don't quite understand why it is logistically complicated to create a >>> sub-project to hold the variant spec and impl. >>> >>> IMHO, coping the variant type spec into Apache Iceberg has some >>> deficiencies: >>> - It is a burden to update two repos if there is a variant type spec >>> change and will likely result in deviation if some changes do not reach >>> agreement from both parties. >>> - Implementers are required to keep an eye on both specs (considering >>> proprietary engines where both Iceberg and Delta are supported). >>> - Putting the spec and impl of variant type in Iceberg repo does lose >>> the opportunity for better native support from file formats like Parquet >>> and ORC. >>> >>> I'm not sure if it is possible to create a separate project (e.g. >>> apache/variant-type) to make it a single point of truth. We can learn from >>> the experience of Apache Arrow. In this fashion, different engines, table >>> formats and file formats can follow the same spec and are free to depend on >>> the reference implementations from apache/variant-type or implement their >>> own. >>> >>> Best, >>> Gang >>> >>> >>> >>> >>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> +1 for copying the spec into our repository, I think we need to own it >>>> fully as a part of the table spec, and we can build compatibility through >>>> tests. >>>> >>>> -Jack >>>> >>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> I'm not really in favor of linking and annotating as that just makes >>>>> things more complicated and still is essentially forking just with more >>>>> steps. If we just track our annotations / modifications to a single >>>>> commit/version then we have the same issue again but now you have to go to >>>>> multiple sources to get the actual Spec. *In addition, our very copy >>>>> of the Spec is going to require new types which don't exist in the Spark >>>>> Spec which necessarily means diverging. *We will need to take up new >>>>> primitive id's (as noted in my first email) >>>>> >>>>> The other issue I have is I don't think the Spark Spec is really going >>>>> through a thorough review process from all members of the Spark community, >>>>> I believe it probably should have gone through the SPIP but instead seems >>>>> to have been merged without broad community involvement. >>>>> >>>>> The only way to truly avoid diverging is to only have a single copy of >>>>> the spec, in our previous discussions the vast majority of Apache Iceberg >>>>> community want it to exist here. >>>>> >>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> >>>>> wrote: >>>>> >>>>>> I'm really excited about the introduction of variant type to Iceberg, >>>>>> but I want to raise concerns about forking the spec. >>>>>> >>>>>> I feel like preemptively forking would create the situation where we >>>>>> end up diverging because there's little reason to work with both >>>>>> communities to evolve in a way that benefits everyone. >>>>>> >>>>>> I would much rather point to a specific version of the spec and >>>>>> annotate any variance in Iceberg's handling. This would allow us to >>>>>> continue without dividing the communities. >>>>>> >>>>>> If at any point there are irreconcilable differences, I would support >>>>>> forking, but I don't feel like that should be the initial step. >>>>>> >>>>>> No one is excited about the possibility that the physical >>>>>> representations end up diverging, but it feels like we're setting >>>>>> ourselves up for that exact scenario. >>>>>> >>>>>> -Dan >>>>>> >>>>>> >>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> +1 to what's already being said here. It is good to copy the spec to >>>>>>> Iceberg and add context that's specific to Iceberg, but at the same >>>>>>> time, >>>>>>> we should maintain compatibility. >>>>>>> >>>>>>> Kind regards, >>>>>>> Fokko >>>>>>> >>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >>>>>>> owenzhang1...@gmail.com>: >>>>>>> >>>>>>>> +1 to copy the spec into our repository. I think the best way to >>>>>>>> keep compatibility is building integration tests. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Manu >>>>>>>> >>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks Russell and Aihua for pushing Variant support! >>>>>>>>> >>>>>>>>> Given the differences between the supported types and the lack of >>>>>>>>> interest from the other project, I think it is reasonable to >>>>>>>>> duplicate the >>>>>>>>> specification to our repository. >>>>>>>>> I would give very strong emphasis on sticking to the Spark spec as >>>>>>>>> much as possible, to keep compatibility as much as possible. Maybe >>>>>>>>> even >>>>>>>>> revert to a shared specification if the situation changes. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., >>>>>>>>> K, 19:52): >>>>>>>>> >>>>>>>>>> Thanks Russell for bringing this up. >>>>>>>>>> >>>>>>>>>> This is the main blocker to move forward with the Variant support >>>>>>>>>> in Iceberg and hopefully we can have a consensus. To me, I also feel >>>>>>>>>> it >>>>>>>>>> makes more sense to move the spec into Iceberg rather than Spark >>>>>>>>>> engine >>>>>>>>>> owns it and we try to keep it compatible with Spark spec. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Aihua >>>>>>>>>> >>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Y’all, >>>>>>>>>>> >>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while >>>>>>>>>>> we were hoping to move the Variant and Shredding specifications >>>>>>>>>>> from Spark >>>>>>>>>>> into Iceberg there doesn’t seem to be a lot of interest in that. >>>>>>>>>>> Unfortunately, I think we have a number of issues with just linking >>>>>>>>>>> to the >>>>>>>>>>> Spark project directly from within Iceberg and *I believe we >>>>>>>>>>> need to copy the specifications into our repository*. >>>>>>>>>>> >>>>>>>>>>> There are a few reasons why i think this is necessary >>>>>>>>>>> >>>>>>>>>>> First, we have a divergence of types already. The Spark >>>>>>>>>>> Specification already includes types which Iceberg has no >>>>>>>>>>> definition for (19, >>>>>>>>>>> 20 >>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types> >>>>>>>>>>> - Interval Types) and Iceberg already has a type which is not >>>>>>>>>>> included >>>>>>>>>>> within the Spark Specification (Time) and will soon have more with >>>>>>>>>>> TimestampNS, and Geo. >>>>>>>>>>> >>>>>>>>>>> Second, We would like to make sure that Spark is not a hard >>>>>>>>>>> dependency for other engines. We are working with several >>>>>>>>>>> implementers of >>>>>>>>>>> the Iceberg spec and it has previously been agreed that it would be >>>>>>>>>>> best if >>>>>>>>>>> the source of truth for Variant existed in an engine and file format >>>>>>>>>>> neutral location. The Iceberg project has a good open model of >>>>>>>>>>> governance >>>>>>>>>>> and, as we have seen so far discussing Variant >>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, >>>>>>>>>>> open and active collaboration. This would also help as we can >>>>>>>>>>> strictly >>>>>>>>>>> version our changes in-line with the rest of the Iceberg spec. >>>>>>>>>>> >>>>>>>>>>> Third, The Shredding spec is not quite finished and requires >>>>>>>>>>> some group analysis and discussion before we commit it. I think >>>>>>>>>>> again the >>>>>>>>>>> Iceberg community is probably the right place for this to happen as >>>>>>>>>>> we have >>>>>>>>>>> already started discussions here on these topics. >>>>>>>>>>> >>>>>>>>>>> For these reasons I think we should go with a direct copy of the >>>>>>>>>>> existing specification from the Spark Project and move ahead with >>>>>>>>>>> our >>>>>>>>>>> discussions and modifications within Iceberg. That said, *I do >>>>>>>>>>> not want to diverge if possible from the Spark proposal*. For >>>>>>>>>>> example, although we do not use the Interval types above, I think >>>>>>>>>>> we should >>>>>>>>>>> not reuse those type ids within our spec. Iceberg's Variant >>>>>>>>>>> Spec types 19 and 20 would remain unused along with any other types >>>>>>>>>>> we >>>>>>>>>>> think are not applicable. We should strive whenever possible to >>>>>>>>>>> allow for >>>>>>>>>>> compatibility. >>>>>>>>>>> >>>>>>>>>>> In the interest of moving forward with this proposal I am hoping >>>>>>>>>>> to see if anyone in the community objects to this plan going >>>>>>>>>>> forward or has >>>>>>>>>>> a better alternative. >>>>>>>>>>> >>>>>>>>>>> As always I am thankful for your time and am eager to hear back >>>>>>>>>>> from everyone, >>>>>>>>>>> Russ >>>>>>>>>>> >>>>>>>>>>>