Given the momentum I see in my day to day around the Parquet Variant type
(Spark, Snowflake, parquet-java, iceberg, soon to be parquet-rs, etc) I am
heavily in favor of making it a canonical extension type in Arrow that is
zero-copy compatible with the Parquet Variant.

Thank you for bringing up this important topic

Andrew

On Mon, May 12, 2025 at 2:21 PM Dewey Dunnington <dewey.dunning...@gmail.com>
wrote:

> > I think we should make clear that this extension type is for
> > transporting Parquet Variants. If we were to design a Variant type
> > specifically for Arrow, it would probably look a bit different
>
> That's a great point...there are definitely advantages to both: keeping the
> spec identical to Parquet means it's easier to implement because it's
> "just" applying a label to the default decode we get from Arrowish Parquet
> readers and unlocks reading and writing without inventing new things.
> Inventing an Arrow-native version that uses more Arrow concepts might be
> faster for purely in-memory transformations, but if all we're ever doing
> with it is encoding and decoding it from/into the Parquet version, is it
> worth it? (Or can we defer the design of that type to a different time
> based on the experience of the Parquet version?)
>
> On Mon, May 12, 2025 at 12:17 PM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > > As far as relying on union types, the reason we can't do so is because
> > > the specific purpose of this Variant type is that we don't know the
> > > types up front, it's dynamic.
> >
> > This is why "VARIANT" is a misnomer for this type. It's a DYNAMIC type,
> not
> > a VARIANT (a type that can be a sum of multiple types that are known).
> >
> > But the mis-use of VARIANT is already widespread, so we now end-up with
> > UNIONS and VARIANTS on the same type system.
> >
> > --
> > Felipe
> >
> > On Mon, May 12, 2025 at 1:21 PM Matt Topol <zotthewiz...@gmail.com>
> wrote:
> >
> > > It's not just Parquet Variant, it's also Iceberg (which has
> > > standardized on this) and Spark in-memory (where this encoding scheme
> > > originated). It's actually an important aspect to this that it
> > > supports easy and (mostly) zero-copy integration with spark's
> > > in-memory representation of the Variant type.
> > >
> > > >  If we were to design a Variant type specifically for Arrow, it would
> > > probably look a bit different (in particular, we would make a better
> use
> > of
> > > validity bitmaps, and we would also rely on union types; besides, the
> > > supported primitive types would reflect the Arrow type system).
> > >
> > > When coming up with this proposal I tried to think of ways I could
> > > better leverage the validity bitmaps, and ultimately they are still
> > > being used. The difficulty comes with having to represent the
> > > difference between a field *missing* and a field being *present* but
> > > null. If we could come up with a better way to represent that, I'd be
> > > fine with using that in this proposal, but I couldn't think of a
> > > better method than what they were using.
> > >
> > > As far as relying on union types, the reason we can't do so is because
> > > the specific purpose of this Variant type is that we don't know the
> > > types up front, it's dynamic. Also Arrow union types do not allow for
> > > arbitrarily recursive structures, it's one of the issues that others
> > > have run into with the Arrow union types. Another consideration is the
> > > goal of projecting the shredded columns. Since we need to project
> > > shredded columns, we'd only be able to use Sparse Unions, not dense
> > > ones, and while it is well defined how to project the child of a
> > > struct it's less defined how to project the child of a union.
> > >
> > > I'm still open to other possibilities in how we represent it, but
> > > following the existing representation in this way seemed the best
> > > route for interoperability.
> > >
> > > --Matt
> > >
> > > On Mon, May 12, 2025 at 2:47 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > > >
> > > >
> > > > Hi Matt,
> > > >
> > > > Thanks for putting this together.
> > > >
> > > > I think we should make clear that this extension type is for
> > > > transporting Parquet Variants. If we were to design a Variant type
> > > > specifically for Arrow, it would probably look a bit different (in
> > > > particular, we would make a better use of validity bitmaps, and we
> > would
> > > > also rely on union types; besides, the supported primitive types
> would
> > > > reflect the Arrow type system).
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 09/05/2025 à 00:00, Matt Topol a écrit :
> > > > > Hey All,
> > > > >
> > > > > There's been various discussions occurring on many different thread
> > > > > locations (issues, PRs, and so on)[1][2][3], and more that I
> haven't
> > > > > linked to, concerning what a canonical Variant Extension Type for
> > > > > Arrow might look like. As I've looked into implementing some
> things,
> > > > > I've also spoken with members of the Arrow, Iceberg and Parquet
> > > > > communities as to what a good representation for Arrow Variant
> would
> > > > > be like in order to ensure good support and adoption.
> > > > >
> > > > > I also looked at the ClickHouse variant implementation [4]. The
> > > > > ClickHouse Variant is nearly equivalent to the Arrow Dense Union
> > type,
> > > > > so we don't need to do any extra work there to support it.
> > > > >
> > > > > So, after discussions and looking into the needs for engines and so
> > > > > on, I've iterated and written up a proposal for what a Canonical
> > > > > Variant Extension Type for Arrow could be in a google doc[5]. I'm
> > > > > hoping that this can spark some discussion and comments on the
> > > > > document. If there's relative consensus on it, then I'll work on
> > > > > creating some implementations of it that I can use to formally
> > propose
> > > > > the addition to the Canonical Extensions.
> > > > >
> > > > > Please take a read and leave comments on the google doc or on this
> > > > > thread. Thanks everyone!
> > > > >
> > > > > --Matt
> > > > >
> > > > > [1]: https://github.com/apache/arrow-rs/issues/7063
> > > > > [2]: https://github.com/apache/arrow/issues/45937
> > > > > [3]:
> > > https://github.com/apache/arrow/pull/45375#issuecomment-2649807352
> > > > > [4]:
> > >
> https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse
> > > > > [5]:
> > >
> >
> https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing
> > > >
> > >
> >
>

Reply via email to