@Gang Wu I agree that it would be beneficial to make a sub-project, the main problem is political and not logistic. I've been asking for movement from other relative projects for a month and we simply haven't gotten anywhere. I don't think there is anything that would stop us from moving to a joint project in the future and if you know of some way of encouraging that movement from other relevant parties I would be glad to collaborate in doing that. One thing that I don't want to do is have the Iceberg project stay in a holding pattern without any clear roadmap as to how to proceed.
On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote: > I’m on board with copying the spec into our repository. However, as we’ve > talked about, it’s not just a straightforward copy—there are already some > divergences. Some of them are under discussion. Iceberg is definitely the > best place for these specs. Engines like Trino and Flink can then rely on > the Iceberg specs as a solid foundation. > > Yufei > > On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote: > >> Sorry for chiming in late. >> >> From the discussion in >> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >> don't quite understand why it is logistically complicated to create a >> sub-project to hold the variant spec and impl. >> >> IMHO, coping the variant type spec into Apache Iceberg has some >> deficiencies: >> - It is a burden to update two repos if there is a variant type spec >> change and will likely result in deviation if some changes do not reach >> agreement from both parties. >> - Implementers are required to keep an eye on both specs (considering >> proprietary engines where both Iceberg and Delta are supported). >> - Putting the spec and impl of variant type in Iceberg repo does lose the >> opportunity for better native support from file formats like Parquet and >> ORC. >> >> I'm not sure if it is possible to create a separate project (e.g. >> apache/variant-type) to make it a single point of truth. We can learn from >> the experience of Apache Arrow. In this fashion, different engines, table >> formats and file formats can follow the same spec and are free to depend on >> the reference implementations from apache/variant-type or implement their >> own. >> >> Best, >> Gang >> >> >> >> >> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>> +1 for copying the spec into our repository, I think we need to own it >>> fully as a part of the table spec, and we can build compatibility through >>> tests. >>> >>> -Jack >>> >>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I'm not really in favor of linking and annotating as that just makes >>>> things more complicated and still is essentially forking just with more >>>> steps. If we just track our annotations / modifications to a single >>>> commit/version then we have the same issue again but now you have to go to >>>> multiple sources to get the actual Spec. *In addition, our very copy >>>> of the Spec is going to require new types which don't exist in the Spark >>>> Spec which necessarily means diverging. *We will need to take up new >>>> primitive id's (as noted in my first email) >>>> >>>> The other issue I have is I don't think the Spark Spec is really going >>>> through a thorough review process from all members of the Spark community, >>>> I believe it probably should have gone through the SPIP but instead seems >>>> to have been merged without broad community involvement. >>>> >>>> The only way to truly avoid diverging is to only have a single copy of >>>> the spec, in our previous discussions the vast majority of Apache Iceberg >>>> community want it to exist here. >>>> >>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> wrote: >>>> >>>>> I'm really excited about the introduction of variant type to Iceberg, >>>>> but I want to raise concerns about forking the spec. >>>>> >>>>> I feel like preemptively forking would create the situation where we >>>>> end up diverging because there's little reason to work with both >>>>> communities to evolve in a way that benefits everyone. >>>>> >>>>> I would much rather point to a specific version of the spec and >>>>> annotate any variance in Iceberg's handling. This would allow us to >>>>> continue without dividing the communities. >>>>> >>>>> If at any point there are irreconcilable differences, I would support >>>>> forking, but I don't feel like that should be the initial step. >>>>> >>>>> No one is excited about the possibility that the physical >>>>> representations end up diverging, but it feels like we're setting >>>>> ourselves up for that exact scenario. >>>>> >>>>> -Dan >>>>> >>>>> >>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org> >>>>> wrote: >>>>> >>>>>> +1 to what's already being said here. It is good to copy the spec to >>>>>> Iceberg and add context that's specific to Iceberg, but at the same time, >>>>>> we should maintain compatibility. >>>>>> >>>>>> Kind regards, >>>>>> Fokko >>>>>> >>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >>>>>> owenzhang1...@gmail.com>: >>>>>> >>>>>>> +1 to copy the spec into our repository. I think the best way to >>>>>>> keep compatibility is building integration tests. >>>>>>> >>>>>>> Thanks, >>>>>>> Manu >>>>>>> >>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks Russell and Aihua for pushing Variant support! >>>>>>>> >>>>>>>> Given the differences between the supported types and the lack of >>>>>>>> interest from the other project, I think it is reasonable to duplicate >>>>>>>> the >>>>>>>> specification to our repository. >>>>>>>> I would give very strong emphasis on sticking to the Spark spec as >>>>>>>> much as possible, to keep compatibility as much as possible. Maybe even >>>>>>>> revert to a shared specification if the situation changes. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Peter >>>>>>>> >>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., K, >>>>>>>> 19:52): >>>>>>>> >>>>>>>>> Thanks Russell for bringing this up. >>>>>>>>> >>>>>>>>> This is the main blocker to move forward with the Variant support >>>>>>>>> in Iceberg and hopefully we can have a consensus. To me, I also feel >>>>>>>>> it >>>>>>>>> makes more sense to move the spec into Iceberg rather than Spark >>>>>>>>> engine >>>>>>>>> owns it and we try to keep it compatible with Spark spec. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Aihua >>>>>>>>> >>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Y’all, >>>>>>>>>> >>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while >>>>>>>>>> we were hoping to move the Variant and Shredding specifications from >>>>>>>>>> Spark >>>>>>>>>> into Iceberg there doesn’t seem to be a lot of interest in that. >>>>>>>>>> Unfortunately, I think we have a number of issues with just linking >>>>>>>>>> to the >>>>>>>>>> Spark project directly from within Iceberg and *I believe we >>>>>>>>>> need to copy the specifications into our repository*. >>>>>>>>>> >>>>>>>>>> There are a few reasons why i think this is necessary >>>>>>>>>> >>>>>>>>>> First, we have a divergence of types already. The Spark >>>>>>>>>> Specification already includes types which Iceberg has no definition >>>>>>>>>> for (19, >>>>>>>>>> 20 >>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types> >>>>>>>>>> - Interval Types) and Iceberg already has a type which is not >>>>>>>>>> included >>>>>>>>>> within the Spark Specification (Time) and will soon have more with >>>>>>>>>> TimestampNS, and Geo. >>>>>>>>>> >>>>>>>>>> Second, We would like to make sure that Spark is not a hard >>>>>>>>>> dependency for other engines. We are working with several >>>>>>>>>> implementers of >>>>>>>>>> the Iceberg spec and it has previously been agreed that it would be >>>>>>>>>> best if >>>>>>>>>> the source of truth for Variant existed in an engine and file format >>>>>>>>>> neutral location. The Iceberg project has a good open model of >>>>>>>>>> governance >>>>>>>>>> and, as we have seen so far discussing Variant >>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, >>>>>>>>>> open and active collaboration. This would also help as we can >>>>>>>>>> strictly >>>>>>>>>> version our changes in-line with the rest of the Iceberg spec. >>>>>>>>>> >>>>>>>>>> Third, The Shredding spec is not quite finished and requires some >>>>>>>>>> group analysis and discussion before we commit it. I think again the >>>>>>>>>> Iceberg community is probably the right place for this to happen as >>>>>>>>>> we have >>>>>>>>>> already started discussions here on these topics. >>>>>>>>>> >>>>>>>>>> For these reasons I think we should go with a direct copy of the >>>>>>>>>> existing specification from the Spark Project and move ahead with our >>>>>>>>>> discussions and modifications within Iceberg. That said, *I do >>>>>>>>>> not want to diverge if possible from the Spark proposal*. For >>>>>>>>>> example, although we do not use the Interval types above, I think we >>>>>>>>>> should >>>>>>>>>> not reuse those type ids within our spec. Iceberg's Variant Spec >>>>>>>>>> types 19 and 20 would remain unused along with any other types we >>>>>>>>>> think are >>>>>>>>>> not applicable. We should strive whenever possible to allow for >>>>>>>>>> compatibility. >>>>>>>>>> >>>>>>>>>> In the interest of moving forward with this proposal I am hoping >>>>>>>>>> to see if anyone in the community objects to this plan going forward >>>>>>>>>> or has >>>>>>>>>> a better alternative. >>>>>>>>>> >>>>>>>>>> As always I am thankful for your time and am eager to hear back >>>>>>>>>> from everyone, >>>>>>>>>> Russ >>>>>>>>>> >>>>>>>>>>