Sorry for the inconvenience. This is the permalink for the discussion: https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org> wrote: > > Hi Gang, > > Sorry, but can you give a pointer to the start of this discussion thread > in a readable format (for example a mailing-list archive)? It appears > that dev@arrow wasn't cc'ed from the start and that can make it > difficult to understand what this is about. > > Regards > > Antoine. > > > Le 22/08/2024 à 08:32, Gang Wu a écrit : > > It seems that we have reached a consensus to some extent that there > > should be a new home for the variant spec. The pending question > > is whether Parquet or Arrow is a better choice. As a committer from > Arrow, > > Parquet and ORC communities, I am neutral to choose any and happy to > > help with the movement once a decision has been made. > > > > Should we start a vote to move forward? > > > > Best, > > Gang > > > > On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >>> > >>> That being said, I think the most important consideration for now is > >> where > >>> are the current maintainers / contributors to the variant type. If most > >> of > >>> them are already PMC members / committers on a project, it becomes a > bit > >>> easier. Otherwise if there isn't much overlap with a project's existing > >>> governance, I worry there could be a bit of friction. How many active > >>> contributors are there from Iceberg? And how about from Arrow? > >> > >> > >> I think this is the key question. What are the requirements around > >> governance? I've seen some tangential messaging here but I'm not clear > on > >> what everyone expects. > >> > >> I think for a lot of the other concerns my view is that the exact > project > >> does not really matter (and choosing a project with mature cross > language > >> testing infrastructure or committing to building it is critical). IIUC > we > >> are talking about following artifacts: > >> > >> 1. A stand alone specification document (this can be hosted anyplace) > >> 2. A set of language bindings with minimal dependencies can be consumed > >> downstream (again, as long as dependencies are managed carefully any > >> project can host these) > >> 3. Potential integration where appropriate into file format libraries > to > >> support shredding (but as of now this is being bypassed by using > >> conventions anyways). My impression is that at least for Parquet there > has > >> been a proliferation of vectorized readers across different projects, so > >> I'm not clear how much standardization in parquet-java could help here. > >> > >> To respond to some other questions: > >> > >> Arrow is not used as Spark's in-memory model, nor Trino and others so > those > >>> existing relationships aren't there. I also worry that differences in > >>> approaches would make it difficult later on. > >> > >> > >> While Arrow is not in the core memory model, for Spark I believe it is > >> still used for IPC for things like Java<->Python. Trino also consumes > Arrow > >> libraries today to support things like Snowflake/Bigquery federation. > But I > >> think this is minor because as mentioned above I think the functional > >> libraries would be relatively stand-alone. > >> > >> Do we think it could be introduced as a canonical extension arrow type? > >> > >> > >> I believe it can be, I think there are probably different layouts > that can > >> be supported: > >> > >> 1. A struct with two variable width bytes columns (metadata and value > data > >> are stored separately and each entry has a 1:1 relationship). > >> 2. Shredded (shredded according to the same convention as parquet), I > >> would need to double check but I don't think Arrow would have problems > here > >> but REE would likely be required to make this efficient (i.e. sparse > value > >> support is important). > >> > >> In both cases the main complexity is providing the necessary functions > for > >> manipulation. > >> > >> Thanks, > >> Micah > >> > >> > >> > >> > >> > >> > >> > >> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <will.jones...@gmail.com> > >> wrote: > >> > >>> In being more engine and format agnostic, I agree the Arrow project > might > >>> be a good host for such a specification. It seems like we want to move > >> away > >>> from hosting in Spark to make it engine agnostic. But moving into > Iceberg > >>> might make it less format agnostic, as I understand multiple formats > >> might > >>> want to implement this. I'm not intimately familiar with the state of > >> this, > >>> but I believe Delta Lake would like to be aligned with the same format > as > >>> Iceberg. In addition, the Lance format (which I work on), will > eventually > >>> be interesting as well. It seems equally bad to me to attach this > >>> specification to a particular table format as it does a particular > query > >>> engine. > >>> > >>> That being said, I think the most important consideration for now is > >> where > >>> are the current maintainers / contributors to the variant type. If most > >> of > >>> them are already PMC members / committers on a project, it becomes a > bit > >>> easier. Otherwise if there isn't much overlap with a project's existing > >>> governance, I worry there could be a bit of friction. How many active > >>> contributors are there from Iceberg? And how about from Arrow? > >>> > >>> BTW, I'd add I'm interested in helping develop an Arrow extension type > >> for > >>> the binary variant type. I've been experimenting with a DataFusion > >>> extension that operates on this [1], and already have some ideas on how > >>> such an extension type might be defined. I'm not yet caught up on the > >>> shredded specification, but I think having just the binary format would > >> be > >>> beneficial for in-memory analytics, which are most relevant to Arrow. > >> I'll > >>> be creating a seperate thread on the Arrow ML about this soon. > >>> > >>> Best, > >>> > >>> Will Jones > >>> > >>> [1] > >>> > >> > https://github.com/datafusion-contrib/datafusion-functions-variant/issues > >>> > >>> > >>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote: > >>> > >>>> + dev@arrow > >>>> > >>>> Thanks for all the valuable suggestions! I am inclined to Micah's idea > >>> that > >>>> Arrow might be a better host compared to Parquet. > >>>> > >>>> To give more context, I am taking the initiative to add the geometry > >> type > >>>> to both Parquet and ORC. I'd like to do the same thing for variant > type > >>> in > >>>> that variant type is engine and file format agnostic. This does mean > >> that > >>>> Parquet might not be the neutral place to hold the variant spec. > >>>> > >>>> Best, > >>>> Gang > >>>> > >>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <jingsongl...@gmail.com> > >>>> wrote: > >>>> > >>>>> Thanks all for your discussion. > >>>>> > >>>>> The Apache Paimon community is also considering support for this > >>>>> Variant type, without a doubt, we hope to maintain consistency with > >>>>> Iceberg. > >>>>> > >>>>> Not only the Paimon community, but also various computing engines > >> need > >>>>> to adapt to this type, such as Flink and StarRocks. We also hope to > >>>>> promote them to adapt to this type. > >>>>> > >>>>> It is worth noting that we also need to standardize many functions > >>>>> related to it. > >>>>> > >>>>> A neutral place to maintain it is a great choice. > >>>>> > >>>>> - As Gang Wu said, a standalone project is good, just like > >>> RoaringBitmap > >>>>> [1]. > >>>>> - As Ryan said, Parquet community is a neutral option too. > >>>>> - As Micah said, Arrow is also an option too. > >>>>> > >>>>> [1] https://github.com/RoaringBitmap > >>>>> > >>>>> Best, > >>>>> Jingsong > >>>>> > >>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield < > >> emkornfi...@gmail.com > >>>> > >>>>> wrote: > >>>>>>> > >>>>>>> Thats fair @Micah, so far all the discussions have been direct and > >>> off > >>>>> the dev list. Would you like to make the request on the public Spark > >>> Dev > >>>>> list? I would be glad to co-sign, I can also draft up a quick email > >> if > >>>> you > >>>>> don't have time. > >>>>>> > >>>>>> > >>>>>> I think once we come to consensus, if you have bandwidth, I think > >> the > >>>>> message might be better coming from you, as you have more context on > >>> some > >>>>> of the non-public conversations, the requirements from an Iceberg > >>>>> perspective on governance and the blockers that were encountered. If > >>>>> details on the conversations can't be shared, (i.e. we are starting > >>> from > >>>>> scratch) it seems like suggesting a new project via SPIP might be the > >>> way > >>>>> forward. I'm happy to help with that if it is useful but I would > >> guess > >>>>> Aihua or Tyler might be in a better place to start as it seems they > >>> have > >>>>> done more serious thinking here. > >>>>>> > >>>>>> If we decide to try to standardize on Parquet or Arrow I'm happy to > >>>> help > >>>>> support the effort in those communities. > >>>>>> > >>>>>> Thanks, > >>>>>> Micah > >>>>>> > >>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer < > >>>>> russell.spit...@gmail.com> wrote: > >>>>>>> > >>>>>>> Thats fair @Micah, so far all the discussions have been direct and > >>> off > >>>>> the dev list. Would you like to make the request on the public Spark > >>> Dev > >>>>> list? I would be glad to co-sign, I can also draft up a quick email > >> if > >>>> you > >>>>> don't have time. > >>>>>>> > >>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield < > >>>> emkornfi...@gmail.com> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>> I agree that it would be beneficial to make a sub-project, the > >>> main > >>>>> problem is political and not logistic. I've been asking for movement > >>> from > >>>>> other relative projects for a month and we simply haven't gotten > >>>> anywhere. > >>>>>>>> > >>>>>>>> > >>>>>>>> I just wanted to double check that these issues were brought > >>> directly > >>>>> to the spark community (i.e. a discussion thread on the Spark > >> developer > >>>>> mailing list) and not via backchannels. > >>>>>>>> > >>>>>>>> I'm not sure the outcome would be different and I don't think > >> this > >>>>> should block forking the spec, but we should make sure that the > >>> decision > >>>> is > >>>>> publicly documented within both communities. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Micah > >>>>>>>> > >>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < > >>>>> russell.spit...@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>> @Gang Wu > >>>>>>>>> > >>>>>>>>> I agree that it would be beneficial to make a sub-project, the > >>> main > >>>>> problem is political and not logistic. I've been asking for movement > >>> from > >>>>> other relative projects for a month and we simply haven't gotten > >>>> anywhere. > >>>>> I don't think there is anything that would stop us from moving to a > >>> joint > >>>>> project in the future and if you know of some way of encouraging that > >>>>> movement from other relevant parties I would be glad to collaborate > >> in > >>>>> doing that. One thing that I don't want to do is have the Iceberg > >>> project > >>>>> stay in a holding pattern without any clear roadmap as to how to > >>> proceed. > >>>>>>>>> > >>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com > >>> > >>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> I’m on board with copying the spec into our repository. > >> However, > >>> as > >>>>> we’ve talked about, it’s not just a straightforward copy—there are > >>>> already > >>>>> some divergences. Some of them are under discussion. Iceberg is > >>>> definitely > >>>>> the best place for these specs. Engines like Trino and Flink can then > >>>> rely > >>>>> on the Iceberg specs as a solid foundation. > >>>>>>>>>> > >>>>>>>>>> Yufei > >>>>>>>>>> > >>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> > >>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Sorry for chiming in late. > >>>>>>>>>>> > >>>>>>>>>>> From the discussion in > >>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I > >>>> don't > >>>>> quite understand why it is logistically complicated to create a > >>>> sub-project > >>>>> to hold the variant spec and impl. > >>>>>>>>>>> > >>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has > >> some > >>>>> deficiencies: > >>>>>>>>>>> - It is a burden to update two repos if there is a variant > >> type > >>>>> spec change and will likely result in deviation if some changes do > >> not > >>>>> reach agreement from both parties. > >>>>>>>>>>> - Implementers are required to keep an eye on both specs > >>>>> (considering proprietary engines where both Iceberg and Delta are > >>>>> supported). > >>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo > >> does > >>>>> lose the opportunity for better native support from file formats like > >>>>> Parquet and ORC. > >>>>>>>>>>> > >>>>>>>>>>> I'm not sure if it is possible to create a separate project > >>> (e.g. > >>>>> apache/variant-type) to make it a single point of truth. We can learn > >>>> from > >>>>> the experience of Apache Arrow. In this fashion, different engines, > >>> table > >>>>> formats and file formats can follow the same spec and are free to > >>> depend > >>>> on > >>>>> the reference implementations from apache/variant-type or implement > >>> their > >>>>> own. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Gang > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com > >>> > >>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> +1 for copying the spec into our repository, I think we need > >> to > >>>>> own it fully as a part of the table spec, and we can build > >>> compatibility > >>>>> through tests. > >>>>>>>>>>>> > >>>>>>>>>>>> -Jack > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < > >>>>> russell.spit...@gmail.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'm not really in favor of linking and annotating as that > >> just > >>>>> makes things more complicated and still is essentially forking just > >>> with > >>>>> more steps. If we just track our annotations / modifications to a > >>> single > >>>>> commit/version then we have the same issue again but now you have to > >> go > >>>> to > >>>>> multiple sources to get the actual Spec. In addition, our very copy > >> of > >>>> the > >>>>> Spec is going to require new types which don't exist in the Spark > >> Spec > >>>>> which necessarily means diverging. We will need to take up new > >>> primitive > >>>>> id's (as noted in my first email) > >>>>>>>>>>>>> > >>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is > >>> really > >>>>> going through a thorough review process from all members of the Spark > >>>>> community, I believe it probably should have gone through the SPIP > >> but > >>>>> instead seems to have been merged without broad community > >> involvement. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The only way to truly avoid diverging is to only have a > >> single > >>>>> copy of the spec, in our previous discussions the vast majority of > >>> Apache > >>>>> Iceberg community want it to exist here. > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks < > >>> dwe...@apache.org > >>>>> > >>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm really excited about the introduction of variant type > >> to > >>>>> Iceberg, but I want to raise concerns about forking the spec. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I feel like preemptively forking would create the situation > >>>>> where we end up diverging because there's little reason to work with > >>> both > >>>>> communities to evolve in a way that benefits everyone. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I would much rather point to a specific version of the spec > >>> and > >>>>> annotate any variance in Iceberg's handling. This would allow us to > >>>>> continue without dividing the communities. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> If at any point there are irreconcilable differences, I > >> would > >>>>> support forking, but I don't feel like that should be the initial > >> step. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> No one is excited about the possibility that the physical > >>>>> representations end up diverging, but it feels like we're setting > >>>> ourselves > >>>>> up for that exact scenario. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -Dan > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong < > >>>>> fo...@apache.org> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> +1 to what's already being said here. It is good to copy > >> the > >>>>> spec to Iceberg and add context that's specific to Iceberg, but at > >> the > >>>> same > >>>>> time, we should maintain compatibility. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Kind regards, > >>>>>>>>>>>>>>> Fokko > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < > >>>>> owenzhang1...@gmail.com>: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the best > >>> way > >>>>> to keep compatibility is building integration tests. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> Manu > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < > >>>>> peter.vary.apa...@gmail.com> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support! > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Given the differences between the supported types and > >> the > >>>>> lack of interest from the other project, I think it is reasonable to > >>>>> duplicate the specification to our repository. > >>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the > >> Spark > >>>>> spec as much as possible, to keep compatibility as much as possible. > >>>> Maybe > >>>>> even revert to a shared specification if the situation changes. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>> Peter > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. > >>> aug. > >>>>> 13., K, 19:52): > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the > >> Variant > >>>>> support in Iceberg and hopefully we can have a consensus. To me, I > >> also > >>>>> feel it makes more sense to move the spec into Iceberg rather than > >>> Spark > >>>>> engine owns it and we try to keep it compatible with Spark spec. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>> Aihua > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < > >>>>> russell.spit...@gmail.com> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hi Y’all, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant > >>> Proposal, > >>>>> while we were hoping to move the Variant and Shredding specifications > >>>> from > >>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in > >> that. > >>>>> Unfortunately, I think we have a number of issues with just linking > >> to > >>>> the > >>>>> Spark project directly from within Iceberg and I believe we need to > >>> copy > >>>>> the specifications into our repository. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is necessary > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The > >> Spark > >>>>> Specification already includes types which Iceberg has no definition > >>> for > >>>>> (19, 20 - Interval Types) and Iceberg already has a type which is not > >>>>> included within the Spark Specification (Time) and will soon have > >> more > >>>> with > >>>>> TimestampNS, and Geo. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a > >>>> hard > >>>>> dependency for other engines. We are working with several > >> implementers > >>> of > >>>>> the Iceberg spec and it has previously been agreed that it would be > >>> best > >>>> if > >>>>> the source of truth for Variant existed in an engine and file format > >>>>> neutral location. The Iceberg project has a good open model of > >>> governance > >>>>> and, as we have seen so far discussing Variant, open and active > >>>>> collaboration. This would also help as we can strictly version our > >>>> changes > >>>>> in-line with the rest of the Iceberg spec. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and > >>>>> requires some group analysis and discussion before we commit it. I > >>> think > >>>>> again the Iceberg community is probably the right place for this to > >>>> happen > >>>>> as we have already started discussions here on these topics. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a direct > >>> copy > >>>>> of the existing specification from the Spark Project and move ahead > >>> with > >>>>> our discussions and modifications within Iceberg. That said, I do not > >>>> want > >>>>> to diverge if possible from the Spark proposal. For example, although > >>> we > >>>> do > >>>>> not use the Interval types above, I think we should not reuse those > >>> type > >>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would > >>> remain > >>>>> unused along with any other types we think are not applicable. We > >>> should > >>>>> strive whenever possible to allow for compatibility. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> In the interest of moving forward with this proposal I > >>> am > >>>>> hoping to see if anyone in the community objects to this plan going > >>>> forward > >>>>> or has a better alternative. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager to > >>> hear > >>>>> back from everyone, > >>>>>>>>>>>>>>>>>>> Russ > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>> > >>>> > >>> > >> > > >