> As far as relying on union types, the reason we can't do so is because > the specific purpose of this Variant type is that we don't know the > types up front, it's dynamic.
This is why "VARIANT" is a misnomer for this type. It's a DYNAMIC type, not a VARIANT (a type that can be a sum of multiple types that are known). But the mis-use of VARIANT is already widespread, so we now end-up with UNIONS and VARIANTS on the same type system. -- Felipe On Mon, May 12, 2025 at 1:21 PM Matt Topol <zotthewiz...@gmail.com> wrote: > It's not just Parquet Variant, it's also Iceberg (which has > standardized on this) and Spark in-memory (where this encoding scheme > originated). It's actually an important aspect to this that it > supports easy and (mostly) zero-copy integration with spark's > in-memory representation of the Variant type. > > > If we were to design a Variant type specifically for Arrow, it would > probably look a bit different (in particular, we would make a better use of > validity bitmaps, and we would also rely on union types; besides, the > supported primitive types would reflect the Arrow type system). > > When coming up with this proposal I tried to think of ways I could > better leverage the validity bitmaps, and ultimately they are still > being used. The difficulty comes with having to represent the > difference between a field *missing* and a field being *present* but > null. If we could come up with a better way to represent that, I'd be > fine with using that in this proposal, but I couldn't think of a > better method than what they were using. > > As far as relying on union types, the reason we can't do so is because > the specific purpose of this Variant type is that we don't know the > types up front, it's dynamic. Also Arrow union types do not allow for > arbitrarily recursive structures, it's one of the issues that others > have run into with the Arrow union types. Another consideration is the > goal of projecting the shredded columns. Since we need to project > shredded columns, we'd only be able to use Sparse Unions, not dense > ones, and while it is well defined how to project the child of a > struct it's less defined how to project the child of a union. > > I'm still open to other possibilities in how we represent it, but > following the existing representation in this way seemed the best > route for interoperability. > > --Matt > > On Mon, May 12, 2025 at 2:47 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Hi Matt, > > > > Thanks for putting this together. > > > > I think we should make clear that this extension type is for > > transporting Parquet Variants. If we were to design a Variant type > > specifically for Arrow, it would probably look a bit different (in > > particular, we would make a better use of validity bitmaps, and we would > > also rely on union types; besides, the supported primitive types would > > reflect the Arrow type system). > > > > Regards > > > > Antoine. > > > > > > Le 09/05/2025 à 00:00, Matt Topol a écrit : > > > Hey All, > > > > > > There's been various discussions occurring on many different thread > > > locations (issues, PRs, and so on)[1][2][3], and more that I haven't > > > linked to, concerning what a canonical Variant Extension Type for > > > Arrow might look like. As I've looked into implementing some things, > > > I've also spoken with members of the Arrow, Iceberg and Parquet > > > communities as to what a good representation for Arrow Variant would > > > be like in order to ensure good support and adoption. > > > > > > I also looked at the ClickHouse variant implementation [4]. The > > > ClickHouse Variant is nearly equivalent to the Arrow Dense Union type, > > > so we don't need to do any extra work there to support it. > > > > > > So, after discussions and looking into the needs for engines and so > > > on, I've iterated and written up a proposal for what a Canonical > > > Variant Extension Type for Arrow could be in a google doc[5]. I'm > > > hoping that this can spark some discussion and comments on the > > > document. If there's relative consensus on it, then I'll work on > > > creating some implementations of it that I can use to formally propose > > > the addition to the Canonical Extensions. > > > > > > Please take a read and leave comments on the google doc or on this > > > thread. Thanks everyone! > > > > > > --Matt > > > > > > [1]: https://github.com/apache/arrow-rs/issues/7063 > > > [2]: https://github.com/apache/arrow/issues/45937 > > > [3]: > https://github.com/apache/arrow/pull/45375#issuecomment-2649807352 > > > [4]: > https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse > > > [5]: > https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing > > >