It's not just Parquet Variant, it's also Iceberg (which has standardized on this) and Spark in-memory (where this encoding scheme originated). It's actually an important aspect to this that it supports easy and (mostly) zero-copy integration with spark's in-memory representation of the Variant type.
> If we were to design a Variant type specifically for Arrow, it would > probably look a bit different (in particular, we would make a better use of > validity bitmaps, and we would also rely on union types; besides, the > supported primitive types would reflect the Arrow type system). When coming up with this proposal I tried to think of ways I could better leverage the validity bitmaps, and ultimately they are still being used. The difficulty comes with having to represent the difference between a field *missing* and a field being *present* but null. If we could come up with a better way to represent that, I'd be fine with using that in this proposal, but I couldn't think of a better method than what they were using. As far as relying on union types, the reason we can't do so is because the specific purpose of this Variant type is that we don't know the types up front, it's dynamic. Also Arrow union types do not allow for arbitrarily recursive structures, it's one of the issues that others have run into with the Arrow union types. Another consideration is the goal of projecting the shredded columns. Since we need to project shredded columns, we'd only be able to use Sparse Unions, not dense ones, and while it is well defined how to project the child of a struct it's less defined how to project the child of a union. I'm still open to other possibilities in how we represent it, but following the existing representation in this way seemed the best route for interoperability. --Matt On Mon, May 12, 2025 at 2:47 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hi Matt, > > Thanks for putting this together. > > I think we should make clear that this extension type is for > transporting Parquet Variants. If we were to design a Variant type > specifically for Arrow, it would probably look a bit different (in > particular, we would make a better use of validity bitmaps, and we would > also rely on union types; besides, the supported primitive types would > reflect the Arrow type system). > > Regards > > Antoine. > > > Le 09/05/2025 à 00:00, Matt Topol a écrit : > > Hey All, > > > > There's been various discussions occurring on many different thread > > locations (issues, PRs, and so on)[1][2][3], and more that I haven't > > linked to, concerning what a canonical Variant Extension Type for > > Arrow might look like. As I've looked into implementing some things, > > I've also spoken with members of the Arrow, Iceberg and Parquet > > communities as to what a good representation for Arrow Variant would > > be like in order to ensure good support and adoption. > > > > I also looked at the ClickHouse variant implementation [4]. The > > ClickHouse Variant is nearly equivalent to the Arrow Dense Union type, > > so we don't need to do any extra work there to support it. > > > > So, after discussions and looking into the needs for engines and so > > on, I've iterated and written up a proposal for what a Canonical > > Variant Extension Type for Arrow could be in a google doc[5]. I'm > > hoping that this can spark some discussion and comments on the > > document. If there's relative consensus on it, then I'll work on > > creating some implementations of it that I can use to formally propose > > the addition to the Canonical Extensions. > > > > Please take a read and leave comments on the google doc or on this > > thread. Thanks everyone! > > > > --Matt > > > > [1]: https://github.com/apache/arrow-rs/issues/7063 > > [2]: https://github.com/apache/arrow/issues/45937 > > [3]: https://github.com/apache/arrow/pull/45375#issuecomment-2649807352 > > [4]: > > https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse > > [5]: > > https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing >