Re: [DISCUSS] Variant Spec Location

Gang Wu Thu, 22 Aug 2024 01:45:47 -0700

Sorry for the inconvenience.

This is the permalink for the discussion:
https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw


On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <[email protected]> wrote:

>
> Hi Gang,
>
> Sorry, but can you give a pointer to the start of this discussion thread
> in a readable format (for example a mailing-list archive)? It appears
> that dev@arrow wasn't cc'ed from the start and that can make it
> difficult to understand what this is about.
>
> Regards
>
> Antoine.
>
>
> Le 22/08/2024 à 08:32, Gang Wu a écrit :
> > It seems that we have reached a consensus to some extent that there
> > should be a new home for the variant spec. The pending question
> > is whether Parquet or Arrow is a better choice. As a committer from
> Arrow,
> > Parquet and ORC communities, I am neutral to choose any and happy to
> > help with the movement once a decision has been made.
> >
> > Should we start a vote to move forward?
> >
> > Best,
> > Gang
> >
> > On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> >>>
> >>> That being said, I think the most important consideration for now is
> >> where
> >>> are the current maintainers / contributors to the variant type. If most
> >> of
> >>> them are already PMC members / committers on a project, it becomes a
> bit
> >>> easier. Otherwise if there isn't much overlap with a project's existing
> >>> governance, I worry there could be a bit of friction. How many active
> >>> contributors are there from Iceberg? And how about from Arrow?
> >>
> >>
> >> I think this is the key question. What are the requirements around
> >> governance?  I've seen some tangential messaging here but I'm not clear
> on
> >> what everyone expects.
> >>
> >> I think for a lot of the other concerns my view is that the exact
> project
> >> does not really matter (and choosing a project with mature cross
> language
> >> testing infrastructure or committing to building it is critical). IIUC
> we
> >> are talking about following artifacts:
> >>
> >> 1.  A stand alone specification document (this can be hosted anyplace)
> >> 2.  A set of language bindings with minimal dependencies can be consumed
> >> downstream (again, as long as dependencies are managed carefully any
> >> project can host these)
> >> 3.  Potential integration where appropriate into file format libraries
> to
> >> support shredding (but as of now this is being bypassed by using
> >> conventions anyways).  My impression is that at least for Parquet there
> has
> >> been a proliferation of vectorized readers across different projects, so
> >> I'm not clear how much standardization in parquet-java could help here.
> >>
> >> To respond to some other questions:
> >>
> >> Arrow is not used as Spark's in-memory model, nor Trino and others so
> those
> >>> existing relationships aren't there. I also worry that differences in
> >>> approaches would make it difficult later on.
> >>
> >>
> >> While Arrow is not in the core memory model, for Spark I believe it is
> >> still used for IPC for things like Java<->Python. Trino also consumes
> Arrow
> >> libraries today to support things like Snowflake/Bigquery federation.
> But I
> >> think this is minor because as mentioned above I think the functional
> >> libraries would be relatively stand-alone.
> >>
> >> Do we think it could be introduced as a canonical extension arrow type?
> >>
> >>
> >>   I believe it can be, I think there are probably different layouts
> that can
> >> be supported:
> >>
> >> 1.  A struct with two variable width bytes columns (metadata and value
> data
> >> are stored separately and each entry has a 1:1 relationship).
> >> 2.  Shredded (shredded according to the same convention as parquet), I
> >> would need to double check but I don't think Arrow would have problems
> here
> >> but REE would likely be required to make this efficient (i.e. sparse
> value
> >> support is important).
> >>
> >> In both cases the main complexity is providing the necessary functions
> for
> >> manipulation.
> >>
> >> Thanks,
> >> Micah
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <[email protected]>
> >> wrote:
> >>
> >>> In being more engine and format agnostic, I agree the Arrow project
> might
> >>> be a good host for such a specification. It seems like we want to move
> >> away
> >>> from hosting in Spark to make it engine agnostic. But moving into
> Iceberg
> >>> might make it less format agnostic, as I understand multiple formats
> >> might
> >>> want to implement this. I'm not intimately familiar with the state of
> >> this,
> >>> but I believe Delta Lake would like to be aligned with the same format
> as
> >>> Iceberg. In addition, the Lance format (which I work on), will
> eventually
> >>> be interesting as well. It seems equally bad to me to attach this
> >>> specification to a particular table format as it does a particular
> query
> >>> engine.
> >>>
> >>> That being said, I think the most important consideration for now is
> >> where
> >>> are the current maintainers / contributors to the variant type. If most
> >> of
> >>> them are already PMC members / committers on a project, it becomes a
> bit
> >>> easier. Otherwise if there isn't much overlap with a project's existing
> >>> governance, I worry there could be a bit of friction. How many active
> >>> contributors are there from Iceberg? And how about from Arrow?
> >>>
> >>> BTW, I'd add I'm interested in helping develop an Arrow extension type
> >> for
> >>> the binary variant type. I've been experimenting with a DataFusion
> >>> extension that operates on this [1], and already have some ideas on how
> >>> such an extension type might be defined. I'm not yet caught up on the
> >>> shredded specification, but I think having just the binary format would
> >> be
> >>> beneficial for in-memory analytics, which are most relevant to Arrow.
> >> I'll
> >>> be creating a seperate thread on the Arrow ML about this soon.
> >>>
> >>> Best,
> >>>
> >>> Will Jones
> >>>
> >>> [1]
> >>>
> >>
> https://github.com/datafusion-contrib/datafusion-functions-variant/issues
> >>>
> >>>
> >>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <[email protected]> wrote:
> >>>
> >>>> + dev@arrow
> >>>>
> >>>> Thanks for all the valuable suggestions! I am inclined to Micah's idea
> >>> that
> >>>> Arrow might be a better host compared to Parquet.
> >>>>
> >>>> To give more context, I am taking the initiative to add the geometry
> >> type
> >>>> to both Parquet and ORC. I'd like to do the same thing for variant
> type
> >>> in
> >>>> that variant type is engine and file format agnostic. This does mean
> >> that
> >>>> Parquet might not be the neutral place to hold the variant spec.
> >>>>
> >>>> Best,
> >>>> Gang
> >>>>
> >>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Thanks all for your discussion.
> >>>>>
> >>>>> The Apache Paimon community is also considering support for this
> >>>>> Variant type, without a doubt, we hope to maintain consistency with
> >>>>> Iceberg.
> >>>>>
> >>>>> Not only the Paimon community, but also various computing engines
> >> need
> >>>>> to adapt to this type, such as Flink and StarRocks. We also hope to
> >>>>> promote them to adapt to this type.
> >>>>>
> >>>>> It is worth noting that we also need to standardize many functions
> >>>>> related to it.
> >>>>>
> >>>>> A neutral place to maintain it is a great choice.
> >>>>>
> >>>>> - As Gang Wu said, a standalone project is good, just like
> >>> RoaringBitmap
> >>>>> [1].
> >>>>> - As Ryan said, Parquet community is a neutral option too.
> >>>>> - As Micah said, Arrow is also an option too.
> >>>>>
> >>>>> [1] https://github.com/RoaringBitmap
> >>>>>
> >>>>> Best,
> >>>>> Jingsong
> >>>>>
> >>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield <
> >> [email protected]
> >>>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Thats fair @Micah, so far all the discussions have been direct and
> >>> off
> >>>>> the dev list. Would you like to make the request on the public Spark
> >>> Dev
> >>>>> list? I would be glad to co-sign, I can also draft up a quick email
> >> if
> >>>> you
> >>>>> don't have time.
> >>>>>>
> >>>>>>
> >>>>>> I think once we come to consensus, if you have bandwidth, I think
> >> the
> >>>>> message might be better coming from you, as you have more context on
> >>> some
> >>>>> of the non-public conversations, the requirements from an Iceberg
> >>>>> perspective on governance and the blockers that were encountered.  If
> >>>>> details on the conversations can't be shared, (i.e. we are starting
> >>> from
> >>>>> scratch) it seems like suggesting a new project via SPIP might be the
> >>> way
> >>>>> forward.  I'm happy to help with that if it is useful but I would
> >> guess
> >>>>> Aihua or Tyler might be in a better place to start as it seems they
> >>> have
> >>>>> done more serious thinking here.
> >>>>>>
> >>>>>> If we decide to try to standardize on Parquet or Arrow I'm happy to
> >>>> help
> >>>>> support the effort in those communities.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Micah
> >>>>>>
> >>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer <
> >>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>> Thats fair @Micah, so far all the discussions have been direct and
> >>> off
> >>>>> the dev list. Would you like to make the request on the public Spark
> >>> Dev
> >>>>> list? I would be glad to co-sign, I can also draft up a quick email
> >> if
> >>>> you
> >>>>> don't have time.
> >>>>>>>
> >>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <
> >>>> [email protected]>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> I agree that it would be beneficial to make a sub-project, the
> >>> main
> >>>>> problem is political and not logistic. I've been asking for movement
> >>> from
> >>>>> other relative projects for a month and we simply haven't gotten
> >>>> anywhere.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I just wanted to double check that these issues were brought
> >>> directly
> >>>>> to the spark community (i.e. a discussion thread on the Spark
> >> developer
> >>>>> mailing list) and not via backchannels.
> >>>>>>>>
> >>>>>>>> I'm not sure the outcome would be different and I don't think
> >> this
> >>>>> should block forking the spec, but we should make sure that the
> >>> decision
> >>>> is
> >>>>> publicly documented within both communities.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Micah
> >>>>>>>>
> >>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
> >>>>> [email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> @Gang Wu
> >>>>>>>>>
> >>>>>>>>> I agree that it would be beneficial to make a sub-project, the
> >>> main
> >>>>> problem is political and not logistic. I've been asking for movement
> >>> from
> >>>>> other relative projects for a month and we simply haven't gotten
> >>>> anywhere.
> >>>>> I don't think there is anything that would stop us from moving to a
> >>> joint
> >>>>> project in the future and if you know of some way of encouraging that
> >>>>> movement from other relevant parties I would be glad to collaborate
> >> in
> >>>>> doing that. One thing that I don't want to do is have the Iceberg
> >>> project
> >>>>> stay in a holding pattern without any clear roadmap as to how to
> >>> proceed.
> >>>>>>>>>
> >>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <[email protected]
> >>>
> >>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I’m on board with copying the spec into our repository.
> >> However,
> >>> as
> >>>>> we’ve talked about, it’s not just a straightforward copy—there are
> >>>> already
> >>>>> some divergences. Some of them are under discussion. Iceberg is
> >>>> definitely
> >>>>> the best place for these specs. Engines like Trino and Flink can then
> >>>> rely
> >>>>> on the Iceberg specs as a solid foundation.
> >>>>>>>>>>
> >>>>>>>>>> Yufei
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <[email protected]>
> >>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry for chiming in late.
> >>>>>>>>>>>
> >>>>>>>>>>>  From the discussion in
> >>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
> >>>> don't
> >>>>> quite understand why it is logistically complicated to create a
> >>>> sub-project
> >>>>> to hold the variant spec and impl.
> >>>>>>>>>>>
> >>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has
> >> some
> >>>>> deficiencies:
> >>>>>>>>>>> - It is a burden to update two repos if there is a variant
> >> type
> >>>>> spec change and will likely result in deviation if some changes do
> >> not
> >>>>> reach agreement from both parties.
> >>>>>>>>>>> - Implementers are required to keep an eye on both specs
> >>>>> (considering proprietary engines where both Iceberg and Delta are
> >>>>> supported).
> >>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo
> >> does
> >>>>> lose the opportunity for better native support from file formats like
> >>>>> Parquet and ORC.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not sure if it is possible to create a separate project
> >>> (e.g.
> >>>>> apache/variant-type) to make it a single point of truth. We can learn
> >>>> from
> >>>>> the experience of Apache Arrow. In this fashion, different engines,
> >>> table
> >>>>> formats and file formats can follow the same spec and are free to
> >>> depend
> >>>> on
> >>>>> the reference implementations from apache/variant-type or implement
> >>> their
> >>>>> own.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Gang
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <[email protected]
> >>>
> >>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 for copying the spec into our repository, I think we need
> >> to
> >>>>> own it fully as a part of the table spec, and we can build
> >>> compatibility
> >>>>> through tests.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Jack
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
> >>>>> [email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm not really in favor of linking and annotating as that
> >> just
> >>>>> makes things more complicated and still is essentially forking just
> >>> with
> >>>>> more steps. If we just track our annotations / modifications  to a
> >>> single
> >>>>> commit/version then we have the same issue again but now you have to
> >> go
> >>>> to
> >>>>> multiple sources to get the actual Spec. In addition, our very copy
> >> of
> >>>> the
> >>>>> Spec is going to require new types which don't exist in the Spark
> >> Spec
> >>>>> which necessarily means diverging. We will need to take up new
> >>> primitive
> >>>>> id's (as noted in my first email)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is
> >>> really
> >>>>> going through a thorough review process from all members of the Spark
> >>>>> community, I believe it probably should have gone through the SPIP
> >> but
> >>>>> instead seems to have been merged without broad community
> >> involvement.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The only way to truly avoid diverging is to only have a
> >> single
> >>>>> copy of the spec, in our previous discussions the vast majority of
> >>> Apache
> >>>>> Iceberg community want it to exist here.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <
> >>> [email protected]
> >>>>>
> >>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm really excited about the introduction of variant type
> >> to
> >>>>> Iceberg, but I want to raise concerns about forking the spec.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I feel like preemptively forking would create the situation
> >>>>> where we end up diverging because there's little reason to work with
> >>> both
> >>>>> communities to evolve in a way that benefits everyone.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I would much rather point to a specific version of the spec
> >>> and
> >>>>> annotate any variance in Iceberg's handling.  This would allow us to
> >>>>> continue without dividing the communities.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If at any point there are irreconcilable differences, I
> >> would
> >>>>> support forking, but I don't feel like that should be the initial
> >> step.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> No one is excited about the possibility that the physical
> >>>>> representations end up diverging, but it feels like we're setting
> >>>> ourselves
> >>>>> up for that exact scenario.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Dan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <
> >>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +1 to what's already being said here. It is good to copy
> >> the
> >>>>> spec to Iceberg and add context that's specific to Iceberg, but at
> >> the
> >>>> same
> >>>>> time, we should maintain compatibility.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>>>>> Fokko
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
> >>>>> [email protected]>:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the best
> >>> way
> >>>>> to keep compatibility is building integration tests.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> Manu
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
> >>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Given the differences between the supported types and
> >> the
> >>>>> lack of interest from the other project, I think it is reasonable to
> >>>>> duplicate the specification to our repository.
> >>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the
> >> Spark
> >>>>> spec as much as possible, to keep compatibility as much as possible.
> >>>> Maybe
> >>>>> even revert to a shared specification if the situation changes.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Aihua Xu <[email protected]> ezt írta (időpont: 2024.
> >>> aug.
> >>>>> 13., K, 19:52):
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the
> >> Variant
> >>>>> support in Iceberg and hopefully we can have a consensus. To me, I
> >> also
> >>>>> feel it makes more sense to move the spec into Iceberg rather than
> >>> Spark
> >>>>> engine owns it and we try to keep it compatible with Spark spec.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> Aihua
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
> >>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Y’all,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant
> >>> Proposal,
> >>>>> while we were hoping to move the Variant and Shredding specifications
> >>>> from
> >>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in
> >> that.
> >>>>> Unfortunately, I think we have a number of issues with just linking
> >> to
> >>>> the
> >>>>> Spark project directly from within Iceberg and I believe we need to
> >>> copy
> >>>>> the specifications into our repository.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is necessary
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The
> >> Spark
> >>>>> Specification already includes types which Iceberg has no definition
> >>> for
> >>>>> (19, 20 - Interval Types) and Iceberg already has a type which is not
> >>>>> included within the Spark Specification (Time) and will soon have
> >> more
> >>>> with
> >>>>> TimestampNS, and Geo.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a
> >>>> hard
> >>>>> dependency for other engines. We are working with several
> >> implementers
> >>> of
> >>>>> the Iceberg spec and it has previously been agreed that it would be
> >>> best
> >>>> if
> >>>>> the source of truth for Variant existed in an engine and file format
> >>>>> neutral location. The Iceberg project has a good open model of
> >>> governance
> >>>>> and, as we have seen so far discussing Variant, open and active
> >>>>> collaboration. This would also help as we can strictly version our
> >>>> changes
> >>>>> in-line with the rest of the Iceberg spec.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and
> >>>>> requires some group analysis and discussion before we commit it. I
> >>> think
> >>>>> again the Iceberg community is probably the right place for this to
> >>>> happen
> >>>>> as we have already started discussions here on these topics.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a direct
> >>> copy
> >>>>> of the existing specification from the Spark Project and move ahead
> >>> with
> >>>>> our discussions and modifications within Iceberg. That said, I do not
> >>>> want
> >>>>> to diverge if possible from the Spark proposal. For example, although
> >>> we
> >>>> do
> >>>>> not use the Interval types above, I think we should not reuse those
> >>> type
> >>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would
> >>> remain
> >>>>> unused along with any other types we think are not applicable. We
> >>> should
> >>>>> strive whenever possible to allow for compatibility.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> In the interest of moving forward with this proposal I
> >>> am
> >>>>> hoping to see if anyone in the community objects to this plan going
> >>>> forward
> >>>>> or has a better alternative.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager to
> >>> hear
> >>>>> back from everyone,
> >>>>>>>>>>>>>>>>>>> Russ
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] Variant Spec Location

Reply via email to