I think Parquet-format is a good place for the spec of Variant. After having the specs in Parquet-format it does not have too much difference than any other Parquet features. The shredding depends on the related type system. It is currently specified for Parquet directly. Do we think there will be significant amounts of code that would be independent from Parquet? If not, I don't think we'll need a separate repo for the implementations. We did not do similar things for other Parquet features. If we think it makes sense we can have a separate module in parquet-java that may only depend on other low level parquet modules (like parquet-format but surely not hadoop). This way any java-based projects can easily use it. What do you think?
Gabor Julien Le Dem <jul...@apache.org> ezt írta (időpont: 2024. aug. 24., Szo, 2:40): > I am in favor of making this a separate artifact that other projects can > depend on without pulling extra dependencies they might not want. > What do others think about a separate repo? > Is the intent to release it independently of the Parquet-format spec? I see > the Variant type also has a version. > Julien > > On Fri, Aug 23, 2024 at 4:31 PM Daniel Weeks <dwe...@apache.org> wrote: > > > Julien, > > > > I think there's interest in supporting multiple language implementations > > for variant (java/scala/cpp/rust/etc), so we might what to consider > having > > a 'parquet-varient' repository to house the spec and language > > implementations. That might also help to keep them aligned, but open to > > other suggestions. > > > > -Dan > > > > On Fri, Aug 23, 2024 at 3:07 PM Julien Le Dem <jul...@apache.org> wrote: > > > > > Hello, > > > I think it is great that we are converging on a Variant type. > > > For the parquet-java implementation, it looks like it could be as easy > as > > > importing the spark implementation [1]? > > > I'm not sure this is actually blocking anything as I'm assuming this > gets > > > stored in a binary type today. > > > Is there an existing Cpp implementation? > > > Are there other existing types defined somewhere else solving that same > > > need that we should be paying attention to? (or should become > compatible > > > with this) > > > Best > > > Julien > > > [1] > > > > > > > > > https://github.com/apache/spark/tree/master/common/variant/src/main/java/org/apache/spark/types/variant > > > > > > > > > On Fri, Aug 23, 2024 at 2:17 PM Jacques Nadeau <jacq...@apache.org> > > wrote: > > > > > > > > Do we have volunteers to implement it in Parquet-java + another OSS > > > > implementation? > > > > > > > > I don't think that should be a blocker for incorporating. I'd be > > inclined > > > > to do something like mark it as experimental or similar in the spec > > until > > > > the reference impls are done. > > > > > > > > On Fri, Aug 23, 2024 at 10:32 AM Micah Kornfield < > > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > I'm in favor of this, but wondering on the logistics. Do we have > > > > > volunteers to implement it in Parquet-java + another OSS > > implementation > > > > or > > > > > are we going to bypass this requirement for now? > > > > > > > > > > Thanks, > > > > > Micah > > > > > > > > > > On Friday, August 23, 2024, Ryan Blue <b...@databricks.com.invalid > > > > > > wrote: > > > > > > > > > > > +1 > > > > > > > > > > > > On Fri, Aug 23, 2024 at 12:30 PM Jacques Nadeau < > > jacq...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > On Fri, Aug 23, 2024 at 8:51 AM Nong Li <non...@gmail.com> > > wrote: > > > > > > > > > > > > > > > +1. > > > > > > > > > > > > > > > > On Fri, Aug 23, 2024 at 12:57 PM Jan Finis < > jpfi...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > I would also appreciate having native Variant support in > > > Parquet. > > > > > > > > > > > > > > > > > > Am Fr., 23. Aug. 2024 um 12:10 Uhr schrieb Fokko > Driesprong < > > > > > > > > > fo...@apache.org>: > > > > > > > > > > > > > > > > > > > Hey Gang, > > > > > > > > > > > > > > > > > > > > Thanks for raising this. +1 from my end. > > > > > > > > > > > > > > > > > > > > For context, as Gang mentioned, when proposing to add a > > > Variant > > > > > > Type > > > > > > > to > > > > > > > > > > Iceberg <https://github.com/apache/iceberg/issues/10392 > >, > > > one > > > > of > > > > > > the > > > > > > > > > > future > > > > > > > > > > goals was to integrate more closely with Parquet, and > > having > > > > the > > > > > > spec > > > > > > > > at > > > > > > > > > > Parquet will help to speed this up. > > > > > > > > > > > > > > > > > > > > Kind regards, > > > > > > > > > > Fokko > > > > > > > > > > > > > > > > > > > > Op vr 23 aug 2024 om 11:37 schreef Gábor Szádovszky < > > > > > > > ga...@apache.org > > > > > > > > >: > > > > > > > > > > > > > > > > > > > > > Hi Gang, > > > > > > > > > > > > > > > > > > > > > > Thanks for bringing this up. > > > > > > > > > > > > > > > > > > > > > > I think that if Variant type would have come up earlier > > > > (before > > > > > > > > > > > iceberg/arrow), its natural place would have been at > the > > > file > > > > > > > format > > > > > > > > > > level > > > > > > > > > > > as any other types. The communities started discussing > > > where > > > > it > > > > > > > > should > > > > > > > > > be > > > > > > > > > > > placed because now we have different type systems at > > > > different > > > > > > > > places. > > > > > > > > > > > Also, the current spec of Variant makes it more or less > > > > > > independent > > > > > > > > > from > > > > > > > > > > > the Parquet file format. > > > > > > > > > > > However, even at Parquet level, we would need at least > an > > > > > > > additional > > > > > > > > > > > Logical type to help handle Variant type by the systems > > > > > > > > reading/writing > > > > > > > > > > > Parquet. > > > > > > > > > > > > > > > > > > > > > > To summarize my opinion, +1 for having the whole > Variant > > > spec > > > > > in > > > > > > > > > Parquet > > > > > > > > > > > format. > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > Gabor > > > > > > > > > > > > > > > > > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. > aug. > > > > 23., > > > > > P, > > > > > > > > > 11:18): > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > Apache Iceberg is adding variant type support [1][2] > by > > > > > > adopting > > > > > > > > the > > > > > > > > > > > > variant > > > > > > > > > > > > spec [3] from Apache Spark. As the proposal is > getting > > > > > mature, > > > > > > > both > > > > > > > > > > > Iceberg > > > > > > > > > > > > [4] > > > > > > > > > > > > and Spark [5] communities are discussing moving the > > > variant > > > > > > type > > > > > > > to > > > > > > > > > > > Parquet > > > > > > > > > > > > repo to avoid divergence. Moving it into Parquet > makes > > > the > > > > > > > variant > > > > > > > > > spec > > > > > > > > > > > > engine > > > > > > > > > > > > and table format agnostic, which may encourage wider > > > > > adoption. > > > > > > > > > > > > > > > > > > > > > > > > What do people from Parquet community think? > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > > > > > > > > > > > [2] > > > > > > > > > > https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq > > > > > > > > > > > > [3] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179 > > > > > > 089ebd71ad/common/variant/README.md > > > > > > > > > > > > [4] > > > > > > > > > > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw > > > > > > > > > > > > [5] > > > > > > > > > > https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Gang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Ryan Blue > > > > > > Databricks > > > > > > > > > > > > > > > > > > > > >