@Andrew: The current proposal in the document is essentially zero-copy compatible with the Parquet Variant (give or take the differences in how variable length binary values are represented in Parquet vs Arrow, and the proposal allows for any of Binary/LargeBinary/BinaryView to represent the values).
@everyone: So far it seems we've hit most of a consensus on the doc, so if no one has any further significant changes to request/suggest then I'm going to start creating an implementation of the Variant Extension type. Since we need implementations to have a vote for it to become a canonical extension type. --Matt On Tue, May 13, 2025 at 6:47 AM Andrew Lamb <al...@influxdata.com> wrote: > > Given the momentum I see in my day to day around the Parquet Variant type > (Spark, Snowflake, parquet-java, iceberg, soon to be parquet-rs, etc) I am > heavily in favor of making it a canonical extension type in Arrow that is > zero-copy compatible with the Parquet Variant. > > Thank you for bringing up this important topic > > Andrew > > On Mon, May 12, 2025 at 2:21 PM Dewey Dunnington <dewey.dunning...@gmail.com> > wrote: > > > > I think we should make clear that this extension type is for > > > transporting Parquet Variants. If we were to design a Variant type > > > specifically for Arrow, it would probably look a bit different > > > > That's a great point...there are definitely advantages to both: keeping the > > spec identical to Parquet means it's easier to implement because it's > > "just" applying a label to the default decode we get from Arrowish Parquet > > readers and unlocks reading and writing without inventing new things. > > Inventing an Arrow-native version that uses more Arrow concepts might be > > faster for purely in-memory transformations, but if all we're ever doing > > with it is encoding and decoding it from/into the Parquet version, is it > > worth it? (Or can we defer the design of that type to a different time > > based on the experience of the Parquet version?) > > > > On Mon, May 12, 2025 at 12:17 PM Felipe Oliveira Carvalho < > > felipe...@gmail.com> wrote: > > > > > > As far as relying on union types, the reason we can't do so is because > > > > the specific purpose of this Variant type is that we don't know the > > > > types up front, it's dynamic. > > > > > > This is why "VARIANT" is a misnomer for this type. It's a DYNAMIC type, > > not > > > a VARIANT (a type that can be a sum of multiple types that are known). > > > > > > But the mis-use of VARIANT is already widespread, so we now end-up with > > > UNIONS and VARIANTS on the same type system. > > > > > > -- > > > Felipe > > > > > > On Mon, May 12, 2025 at 1:21 PM Matt Topol <zotthewiz...@gmail.com> > > wrote: > > > > > > > It's not just Parquet Variant, it's also Iceberg (which has > > > > standardized on this) and Spark in-memory (where this encoding scheme > > > > originated). It's actually an important aspect to this that it > > > > supports easy and (mostly) zero-copy integration with spark's > > > > in-memory representation of the Variant type. > > > > > > > > > If we were to design a Variant type specifically for Arrow, it would > > > > probably look a bit different (in particular, we would make a better > > use > > > of > > > > validity bitmaps, and we would also rely on union types; besides, the > > > > supported primitive types would reflect the Arrow type system). > > > > > > > > When coming up with this proposal I tried to think of ways I could > > > > better leverage the validity bitmaps, and ultimately they are still > > > > being used. The difficulty comes with having to represent the > > > > difference between a field *missing* and a field being *present* but > > > > null. If we could come up with a better way to represent that, I'd be > > > > fine with using that in this proposal, but I couldn't think of a > > > > better method than what they were using. > > > > > > > > As far as relying on union types, the reason we can't do so is because > > > > the specific purpose of this Variant type is that we don't know the > > > > types up front, it's dynamic. Also Arrow union types do not allow for > > > > arbitrarily recursive structures, it's one of the issues that others > > > > have run into with the Arrow union types. Another consideration is the > > > > goal of projecting the shredded columns. Since we need to project > > > > shredded columns, we'd only be able to use Sparse Unions, not dense > > > > ones, and while it is well defined how to project the child of a > > > > struct it's less defined how to project the child of a union. > > > > > > > > I'm still open to other possibilities in how we represent it, but > > > > following the existing representation in this way seemed the best > > > > route for interoperability. > > > > > > > > --Matt > > > > > > > > On Mon, May 12, 2025 at 2:47 AM Antoine Pitrou <anto...@python.org> > > > wrote: > > > > > > > > > > > > > > > Hi Matt, > > > > > > > > > > Thanks for putting this together. > > > > > > > > > > I think we should make clear that this extension type is for > > > > > transporting Parquet Variants. If we were to design a Variant type > > > > > specifically for Arrow, it would probably look a bit different (in > > > > > particular, we would make a better use of validity bitmaps, and we > > > would > > > > > also rely on union types; besides, the supported primitive types > > would > > > > > reflect the Arrow type system). > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > Le 09/05/2025 à 00:00, Matt Topol a écrit : > > > > > > Hey All, > > > > > > > > > > > > There's been various discussions occurring on many different thread > > > > > > locations (issues, PRs, and so on)[1][2][3], and more that I > > haven't > > > > > > linked to, concerning what a canonical Variant Extension Type for > > > > > > Arrow might look like. As I've looked into implementing some > > things, > > > > > > I've also spoken with members of the Arrow, Iceberg and Parquet > > > > > > communities as to what a good representation for Arrow Variant > > would > > > > > > be like in order to ensure good support and adoption. > > > > > > > > > > > > I also looked at the ClickHouse variant implementation [4]. The > > > > > > ClickHouse Variant is nearly equivalent to the Arrow Dense Union > > > type, > > > > > > so we don't need to do any extra work there to support it. > > > > > > > > > > > > So, after discussions and looking into the needs for engines and so > > > > > > on, I've iterated and written up a proposal for what a Canonical > > > > > > Variant Extension Type for Arrow could be in a google doc[5]. I'm > > > > > > hoping that this can spark some discussion and comments on the > > > > > > document. If there's relative consensus on it, then I'll work on > > > > > > creating some implementations of it that I can use to formally > > > propose > > > > > > the addition to the Canonical Extensions. > > > > > > > > > > > > Please take a read and leave comments on the google doc or on this > > > > > > thread. Thanks everyone! > > > > > > > > > > > > --Matt > > > > > > > > > > > > [1]: https://github.com/apache/arrow-rs/issues/7063 > > > > > > [2]: https://github.com/apache/arrow/issues/45937 > > > > > > [3]: > > > > https://github.com/apache/arrow/pull/45375#issuecomment-2649807352 > > > > > > [4]: > > > > > > https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse > > > > > > [5]: > > > > > > > > > https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing > > > > > > > > > > > > > >