Update: The Spark community has successfully closed the vote [1]
to move Variant spec to Parquet. I will start a formal vote on the Parquet
side, though we have already received sufficient binding votes in this
thread.

[1] https://lists.apache.org/thread/pkybo148j6qyn2wsjnmyrhqs3crn9b89

Best,
Gang

On Wed, Aug 28, 2024 at 4:19 PM Antoine Pitrou <anto...@python.org> wrote:

>
> I would favor a dedicated repo, to avoid giving the impression that it
> is somehow tied to the Parquet file format.
>
> Regards
>
> Antoine.
>
>
> On Mon, 26 Aug 2024 09:39:49 -0700
> Ryan Blue <b...@databricks.com.INVALID>
> wrote:
> > I think it makes sense to either put it in parquet-format or its own
> repo.
> > I think the main thing is that we want this to be self-contained so that
> it
> > can be used broadly.
> >
> > On Mon, Aug 26, 2024 at 12:56 AM Fokko Driesprong <
> fokko-1odqgaof3lkdnm+yrof...@public.gmane.org> wrote:
> >
> > > I suggested a separate repo in another thread, but I prefer to merge it
> > > into parquet-format, for the reasons that Gábor already pointed out.
> > >
> > >
> > > > It seems reasonable to put the java implementation in the
> parquet-java
> > >
> > >
> > > I also agree with that, it should be just a module in the Maven
> project.
> > >
> > > Kind regards,
> > > Fokko
> > >
> > > Op ma 26 aug 2024 om 09:06 schreef Gang Wu <
> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org>:
> > >
> > > > I thought a separate repo is considered for hosting variant
> > > > implementations for different languages. For the variant spec,
> > > > it makes sense to be moved to the parquet-format repository.
> > > > Considering the fact that parquet implementations are scattered
> > > > in different repos (parquet-java, arrow-cpp, arrow-rs, etc.), it
> seems
> > > > reasonable to put the java implementation in the parquet-java, if
> > > > we can manage the release cycle to meet the expectation of
> > > > downstream projects.
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Mon, Aug 26, 2024 at 2:59 PM Gábor Szádovszky <ga...@apache.org>
>
> > > wrote:
> > > >
> > > > > Sorry, I've created another head for the thread. Let me put it
> back
> > > here.
> > > > >
> > > > > I think Parquet-format is a good place for the spec of Variant.
> > > > >
> > > > > After having the specs in Parquet-format it does not have too much
> > > > > difference than any other Parquet features. The shredding depends
> on
> > > the
> > > > > related type system. It is currently specified for Parquet
> directly. Do
> > > > we
> > > > > think there will be significant amounts of code that would be
> > > independent
> > > > > from Parquet? If not, I don't think we'll need a separate repo for
> the
> > > > > implementations. We did not do similar things for other Parquet
> > > features.
> > > > > If we think it makes sense we can have a separate module in
> > > parquet-java
> > > > > that may only depend on other low level parquet modules (like
> > > > > parquet-format but surely not hadoop). This way any java-based
> projects
> > > > can
> > > > > easily use it.
> > > > > What do you think?
> > > > >
> > > > > Gabor
> > > > >
> > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. aug. 26., H,
> > > 8:51):
> > > > >
> > > > > > A separate repo for variant type makes sense to me. And I don't
> think
> > > > > > we need to have two reference implementations ready before the
> > > > > > adoption because it is already a released spec.
> > > > > >
> > > > > > > Is the intent to release it independently of the
> Parquet-format
> > > spec?
> > > > > > > I see the Variant type also has a version.
> > > > > >
> > > > > > IIUC, the version field in the variant spec advises how variant
> data
> > > is
> > > > > > encoded. If this is the case, we should bump parquet-format
> version
> > > > > > when a new encoding scheme is introduced.
> > > > > >
> > > > > > Best,
> > > > > > Gang
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Aug 24, 2024 at 8:43 AM Julien Le Dem <jul...@apache.org>
>
> > > > wrote:
> > > > > >
> > > > > > > (Note: I am also catching up on the threads linked in the
> email)
> > > > > > >
> > > > > > > On Fri, Aug 23, 2024 at 5:38 PM Julien Le Dem <
> jul...@apache.org>
> > > > > wrote:
> > > > > > >
> > > > > > > > I am in favor of making this a separate artifact that other
> > > > projects
> > > > > > can
> > > > > > > > depend on without pulling extra dependencies they might not
> want.
> > > > > > > > What do others think about a separate repo?
> > > > > > > > Is the intent to release it independently of the
> Parquet-format
> > > > > spec? I
> > > > > > > > see the Variant type also has a version.
> > > > > > > > Julien
> > > > > > > >
> > > > > > > > On Fri, Aug 23, 2024 at 4:31 PM Daniel Weeks <
> dwe...@apache.org>
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Julien,
> > > > > > > >>
> > > > > > > >> I think there's interest in supporting multiple language
> > > > > > implementations
> > > > > > > >> for variant (java/scala/cpp/rust/etc), so we might what to
> > > > consider
> > > > > > > having
> > > > > > > >> a 'parquet-varient' repository to house the spec and
> language
> > > > > > > >> implementations.  That might also help to keep them
> aligned, but
> > > > > open
> > > > > > to
> > > > > > > >> other suggestions.
> > > > > > > >>
> > > > > > > >> -Dan
> > > > > > > >>
> > > > > > > >> On Fri, Aug 23, 2024 at 3:07 PM Julien Le Dem <
> > > jul...@apache.org>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > Hello,
> > > > > > > >> > I think it is great that we are converging on a Variant
> type.
> > > > > > > >> > For the parquet-java implementation, it looks like it
> could be
> > > > as
> > > > > > easy
> > > > > > > >> as
> > > > > > > >> > importing the spark implementation [1]?
> > > > > > > >> > I'm not sure this is actually blocking anything as I'm
> > > assuming
> > > > > this
> > > > > > > >> gets
> > > > > > > >> > stored in a binary type today.
> > > > > > > >> > Is there an existing Cpp implementation?
> > > > > > > >> > Are there other existing types defined somewhere else
> solving
> > > > that
> > > > > > > same
> > > > > > > >> > need that we should be paying attention to? (or should
> become
> > > > > > > compatible
> > > > > > > >> > with this)
> > > > > > > >> > Best
> > > > > > > >> > Julien
> > > > > > > >> > [1]
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/spark/tree/master/common/variant/src/main/java/org/apache/spark/types/variant
>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Fri, Aug 23, 2024 at 2:17 PM Jacques Nadeau <
> > > > > jacq...@apache.org>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > > Do we have volunteers to implement it in Parquet-java
> +
> > > > > another
> > > > > > > OSS
> > > > > > > >> > > implementation?
> > > > > > > >> > >
> > > > > > > >> > > I don't think that should be a blocker for
> incorporating.
> > > I'd
> > > > be
> > > > > > > >> inclined
> > > > > > > >> > > to do something like mark it as experimental or similar
> in
> > > the
> > > > > > spec
> > > > > > > >> until
> > > > > > > >> > > the reference impls are done.
> > > > > > > >> > >
> > > > > > > >> > > On Fri, Aug 23, 2024 at 10:32 AM Micah Kornfield <
> > > > > > > >> emkornfi...@gmail.com>
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > I'm in favor of this, but wondering on the
> logistics.  Do
> > > we
> > > > > > have
> > > > > > > >> > > > volunteers to implement it in Parquet-java + another
> OSS
> > > > > > > >> implementation
> > > > > > > >> > > or
> > > > > > > >> > > > are we going to bypass this requirement for now?
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks,
> > > > > > > >> > > > Micah
> > > > > > > >> > > >
> > > > > > > >> > > > On Friday, August 23, 2024, Ryan Blue
> > > > > > <b...@databricks.com.invalid
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > +1
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Fri, Aug 23, 2024 at 12:30 PM Jacques Nadeau <
> > > > > > > >> jacq...@apache.org>
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > > +1
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > On Fri, Aug 23, 2024 at 8:51 AM Nong Li <
> > > > non...@gmail.com
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > > +1.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > On Fri, Aug 23, 2024 at 12:57 PM Jan Finis <
> > > > > > > jpfi...@gmail.com
> > > > > > > >> >
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > I would also appreciate having native
> Variant
> > > > support
> > > > > in
> > > > > > > >> > Parquet.
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Am Fr., 23. Aug. 2024 um 12:10 Uhr schrieb
> Fokko
> > > > > > > Driesprong
> > > > > > > >> <
> > > > > > > >> > > > > > > > fo...@apache.org>:
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > > Hey Gang,
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Thanks for raising this. +1 from my end.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > For context, as Gang mentioned, when
> proposing
> > > to
> > > > > add
> > > > > > a
> > > > > > > >> > Variant
> > > > > > > >> > > > > Type
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > > Iceberg <
> > > > > > https://github.com/apache/iceberg/issues/10392
> > > > > > > >,
> > > > > > > >> > one
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > > > > future
> > > > > > > >> > > > > > > > > goals was to integrate more closely with
> > > Parquet,
> > > > > and
> > > > > > > >> having
> > > > > > > >> > > the
> > > > > > > >> > > > > spec
> > > > > > > >> > > > > > > at
> > > > > > > >> > > > > > > > > Parquet will help to speed this up.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Kind regards,
> > > > > > > >> > > > > > > > > Fokko
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Op vr 23 aug 2024 om 11:37 schreef Gábor
> > > > Szádovszky
> > > > > <
> > > > > > > >> > > > > > ga...@apache.org
> > > > > > > >> > > > > > > >:
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > > Hi Gang,
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Thanks for bringing this up.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > I think that if Variant type would have
> come
> > > up
> > > > > > > earlier
> > > > > > > >> > > (before
> > > > > > > >> > > > > > > > > > iceberg/arrow), its natural place would
> have
> > > > been
> > > > > at
> > > > > > > the
> > > > > > > >> > file
> > > > > > > >> > > > > > format
> > > > > > > >> > > > > > > > > level
> > > > > > > >> > > > > > > > > > as any other types. The communities
> started
> > > > > > discussing
> > > > > > > >> > where
> > > > > > > >> > > it
> > > > > > > >> > > > > > > should
> > > > > > > >> > > > > > > > be
> > > > > > > >> > > > > > > > > > placed because now we have different
> type
> > > > systems
> > > > > at
> > > > > > > >> > > different
> > > > > > > >> > > > > > > places.
> > > > > > > >> > > > > > > > > > Also, the current spec of Variant makes
> it
> > > more
> > > > or
> > > > > > > less
> > > > > > > >> > > > > independent
> > > > > > > >> > > > > > > > from
> > > > > > > >> > > > > > > > > > the Parquet file format.
> > > > > > > >> > > > > > > > > > However, even at Parquet level, we would
> need
> > > at
> > > > > > least
> > > > > > > >> an
> > > > > > > >> > > > > > additional
> > > > > > > >> > > > > > > > > > Logical type to help handle Variant type
> by
> > > the
> > > > > > > systems
> > > > > > > >> > > > > > > reading/writing
> > > > > > > >> > > > > > > > > > Parquet.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > To summarize my opinion, +1 for having
> the
> > > whole
> > > > > > > Variant
> > > > > > > >> > spec
> > > > > > > >> > > > in
> > > > > > > >> > > > > > > > Parquet
> > > > > > > >> > > > > > > > > > format.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > >> > > > > > > > > > Gabor
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Gang Wu <
> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> ezt írta (időpont:
> > > > > 2024.
> > > > > > > >> aug.
> > > > > > > >> > > 23.,
> > > > > > > >> > > > P,
> > > > > > > >> > > > > > > > 11:18):
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Hi,
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Apache Iceberg is adding variant type
> > > support
> > > > > > [1][2]
> > > > > > > >> by
> > > > > > > >> > > > > adopting
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > > > variant
> > > > > > > >> > > > > > > > > > > spec [3] from Apache Spark. As the
> proposal
> > > is
> > > > > > > getting
> > > > > > > >> > > > mature,
> > > > > > > >> > > > > > both
> > > > > > > >> > > > > > > > > > Iceberg
> > > > > > > >> > > > > > > > > > > [4]
> > > > > > > >> > > > > > > > > > > and Spark [5] communities are
> discussing
> > > > moving
> > > > > > the
> > > > > > > >> > variant
> > > > > > > >> > > > > type
> > > > > > > >> > > > > > to
> > > > > > > >> > > > > > > > > > Parquet
> > > > > > > >> > > > > > > > > > > repo to avoid divergence. Moving it
> into
> > > > Parquet
> > > > > > > makes
> > > > > > > >> > the
> > > > > > > >> > > > > > variant
> > > > > > > >> > > > > > > > spec
> > > > > > > >> > > > > > > > > > > engine
> > > > > > > >> > > > > > > > > > > and table format agnostic, which may
> > > encourage
> > > > > > wider
> > > > > > > >> > > > adoption.
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > What do people from Parquet community
> think?
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > [1]
> > > > > > > >> > > > > > >
> > > > > > > >>
> > > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > > > > > > >> > > > > > > > > > > [2]
> > > > > > > >> > > > > > >
> > > > > > > >>
> > > https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq
> > > > > > > >> > > > > > > > > > > [3]
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> >
> > > > >
> https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179
> > > > > > > >> > > > > 089ebd71ad/common/variant/README.md
> > > > > > > >> > > > > > > > > > > [4]
> > > > > > > >> > > > > > >
> > > > > > > >>
> > > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
> > > > > > > >> > > > > > > > > > > [5]
> > > > > > > >> > > > > > >
> > > > > > > >>
> > > https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > > > Best,
> > > > > > > >> > > > > > > > > > > Gang
> > > > > > > >> > > > > > > > > > >
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > --
> > > > > > > >> > > > > Ryan Blue
> > > > > > > >> > > > > Databricks
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
>
>
>

Reply via email to