Sorry, I've created another head for the thread. Let me put it back here. I think Parquet-format is a good place for the spec of Variant.
After having the specs in Parquet-format it does not have too much difference than any other Parquet features. The shredding depends on the related type system. It is currently specified for Parquet directly. Do we think there will be significant amounts of code that would be independent from Parquet? If not, I don't think we'll need a separate repo for the implementations. We did not do similar things for other Parquet features. If we think it makes sense we can have a separate module in parquet-java that may only depend on other low level parquet modules (like parquet-format but surely not hadoop). This way any java-based projects can easily use it. What do you think? Gabor Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. aug. 26., H, 8:51): > A separate repo for variant type makes sense to me. And I don't think > we need to have two reference implementations ready before the > adoption because it is already a released spec. > > > Is the intent to release it independently of the Parquet-format spec? > > I see the Variant type also has a version. > > IIUC, the version field in the variant spec advises how variant data is > encoded. If this is the case, we should bump parquet-format version > when a new encoding scheme is introduced. > > Best, > Gang > > > > > > On Sat, Aug 24, 2024 at 8:43 AM Julien Le Dem <jul...@apache.org> wrote: > > > (Note: I am also catching up on the threads linked in the email) > > > > On Fri, Aug 23, 2024 at 5:38 PM Julien Le Dem <jul...@apache.org> wrote: > > > > > I am in favor of making this a separate artifact that other projects > can > > > depend on without pulling extra dependencies they might not want. > > > What do others think about a separate repo? > > > Is the intent to release it independently of the Parquet-format spec? I > > > see the Variant type also has a version. > > > Julien > > > > > > On Fri, Aug 23, 2024 at 4:31 PM Daniel Weeks <dwe...@apache.org> > wrote: > > > > > >> Julien, > > >> > > >> I think there's interest in supporting multiple language > implementations > > >> for variant (java/scala/cpp/rust/etc), so we might what to consider > > having > > >> a 'parquet-varient' repository to house the spec and language > > >> implementations. That might also help to keep them aligned, but open > to > > >> other suggestions. > > >> > > >> -Dan > > >> > > >> On Fri, Aug 23, 2024 at 3:07 PM Julien Le Dem <jul...@apache.org> > > wrote: > > >> > > >> > Hello, > > >> > I think it is great that we are converging on a Variant type. > > >> > For the parquet-java implementation, it looks like it could be as > easy > > >> as > > >> > importing the spark implementation [1]? > > >> > I'm not sure this is actually blocking anything as I'm assuming this > > >> gets > > >> > stored in a binary type today. > > >> > Is there an existing Cpp implementation? > > >> > Are there other existing types defined somewhere else solving that > > same > > >> > need that we should be paying attention to? (or should become > > compatible > > >> > with this) > > >> > Best > > >> > Julien > > >> > [1] > > >> > > > >> > > > >> > > > https://github.com/apache/spark/tree/master/common/variant/src/main/java/org/apache/spark/types/variant > > >> > > > >> > > > >> > On Fri, Aug 23, 2024 at 2:17 PM Jacques Nadeau <jacq...@apache.org> > > >> wrote: > > >> > > > >> > > > Do we have volunteers to implement it in Parquet-java + another > > OSS > > >> > > implementation? > > >> > > > > >> > > I don't think that should be a blocker for incorporating. I'd be > > >> inclined > > >> > > to do something like mark it as experimental or similar in the > spec > > >> until > > >> > > the reference impls are done. > > >> > > > > >> > > On Fri, Aug 23, 2024 at 10:32 AM Micah Kornfield < > > >> emkornfi...@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > I'm in favor of this, but wondering on the logistics. Do we > have > > >> > > > volunteers to implement it in Parquet-java + another OSS > > >> implementation > > >> > > or > > >> > > > are we going to bypass this requirement for now? > > >> > > > > > >> > > > Thanks, > > >> > > > Micah > > >> > > > > > >> > > > On Friday, August 23, 2024, Ryan Blue > <b...@databricks.com.invalid > > > > > >> > > wrote: > > >> > > > > > >> > > > > +1 > > >> > > > > > > >> > > > > On Fri, Aug 23, 2024 at 12:30 PM Jacques Nadeau < > > >> jacq...@apache.org> > > >> > > > > wrote: > > >> > > > > > > >> > > > > > +1 > > >> > > > > > > > >> > > > > > On Fri, Aug 23, 2024 at 8:51 AM Nong Li <non...@gmail.com> > > >> wrote: > > >> > > > > > > > >> > > > > > > +1. > > >> > > > > > > > > >> > > > > > > On Fri, Aug 23, 2024 at 12:57 PM Jan Finis < > > jpfi...@gmail.com > > >> > > > >> > > > wrote: > > >> > > > > > > > > >> > > > > > > > I would also appreciate having native Variant support in > > >> > Parquet. > > >> > > > > > > > > > >> > > > > > > > Am Fr., 23. Aug. 2024 um 12:10 Uhr schrieb Fokko > > Driesprong > > >> < > > >> > > > > > > > fo...@apache.org>: > > >> > > > > > > > > > >> > > > > > > > > Hey Gang, > > >> > > > > > > > > > > >> > > > > > > > > Thanks for raising this. +1 from my end. > > >> > > > > > > > > > > >> > > > > > > > > For context, as Gang mentioned, when proposing to add > a > > >> > Variant > > >> > > > > Type > > >> > > > > > to > > >> > > > > > > > > Iceberg < > https://github.com/apache/iceberg/issues/10392 > > >, > > >> > one > > >> > > of > > >> > > > > the > > >> > > > > > > > > future > > >> > > > > > > > > goals was to integrate more closely with Parquet, and > > >> having > > >> > > the > > >> > > > > spec > > >> > > > > > > at > > >> > > > > > > > > Parquet will help to speed this up. > > >> > > > > > > > > > > >> > > > > > > > > Kind regards, > > >> > > > > > > > > Fokko > > >> > > > > > > > > > > >> > > > > > > > > Op vr 23 aug 2024 om 11:37 schreef Gábor Szádovszky < > > >> > > > > > ga...@apache.org > > >> > > > > > > >: > > >> > > > > > > > > > > >> > > > > > > > > > Hi Gang, > > >> > > > > > > > > > > > >> > > > > > > > > > Thanks for bringing this up. > > >> > > > > > > > > > > > >> > > > > > > > > > I think that if Variant type would have come up > > earlier > > >> > > (before > > >> > > > > > > > > > iceberg/arrow), its natural place would have been at > > the > > >> > file > > >> > > > > > format > > >> > > > > > > > > level > > >> > > > > > > > > > as any other types. The communities started > discussing > > >> > where > > >> > > it > > >> > > > > > > should > > >> > > > > > > > be > > >> > > > > > > > > > placed because now we have different type systems at > > >> > > different > > >> > > > > > > places. > > >> > > > > > > > > > Also, the current spec of Variant makes it more or > > less > > >> > > > > independent > > >> > > > > > > > from > > >> > > > > > > > > > the Parquet file format. > > >> > > > > > > > > > However, even at Parquet level, we would need at > least > > >> an > > >> > > > > > additional > > >> > > > > > > > > > Logical type to help handle Variant type by the > > systems > > >> > > > > > > reading/writing > > >> > > > > > > > > > Parquet. > > >> > > > > > > > > > > > >> > > > > > > > > > To summarize my opinion, +1 for having the whole > > Variant > > >> > spec > > >> > > > in > > >> > > > > > > > Parquet > > >> > > > > > > > > > format. > > >> > > > > > > > > > > > >> > > > > > > > > > Cheers, > > >> > > > > > > > > > Gabor > > >> > > > > > > > > > > > >> > > > > > > > > > Gang Wu <ust...@gmail.com> ezt írta (időpont: 2024. > > >> aug. > > >> > > 23., > > >> > > > P, > > >> > > > > > > > 11:18): > > >> > > > > > > > > > > > >> > > > > > > > > > > Hi, > > >> > > > > > > > > > > > > >> > > > > > > > > > > Apache Iceberg is adding variant type support > [1][2] > > >> by > > >> > > > > adopting > > >> > > > > > > the > > >> > > > > > > > > > > variant > > >> > > > > > > > > > > spec [3] from Apache Spark. As the proposal is > > getting > > >> > > > mature, > > >> > > > > > both > > >> > > > > > > > > > Iceberg > > >> > > > > > > > > > > [4] > > >> > > > > > > > > > > and Spark [5] communities are discussing moving > the > > >> > variant > > >> > > > > type > > >> > > > > > to > > >> > > > > > > > > > Parquet > > >> > > > > > > > > > > repo to avoid divergence. Moving it into Parquet > > makes > > >> > the > > >> > > > > > variant > > >> > > > > > > > spec > > >> > > > > > > > > > > engine > > >> > > > > > > > > > > and table format agnostic, which may encourage > wider > > >> > > > adoption. > > >> > > > > > > > > > > > > >> > > > > > > > > > > What do people from Parquet community think? > > >> > > > > > > > > > > > > >> > > > > > > > > > > [1] > > >> > > > > > > > > >> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > >> > > > > > > > > > > [2] > > >> > > > > > > > > >> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq > > >> > > > > > > > > > > [3] > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179 > > >> > > > > 089ebd71ad/common/variant/README.md > > >> > > > > > > > > > > [4] > > >> > > > > > > > > >> https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw > > >> > > > > > > > > > > [5] > > >> > > > > > > > > >> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj > > >> > > > > > > > > > > > > >> > > > > > > > > > > Best, > > >> > > > > > > > > > > Gang > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > -- > > >> > > > > Ryan Blue > > >> > > > > Databricks > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > >