When I went through the parquet variant spec, I found that an arrow extension type might be a must because decoding the parquet row by row is so inefficient.
I've draft a decoding tool in parquet c++ and ready for review now [1] [1] https://github.com/apache/arrow/pull/46372 Best, Xuwei Fu Matt Topol <zotthewiz...@gmail.com> 于2025年5月9日周五 06:03写道: > Hey All, > > There's been various discussions occurring on many different thread > locations (issues, PRs, and so on)[1][2][3], and more that I haven't > linked to, concerning what a canonical Variant Extension Type for > Arrow might look like. As I've looked into implementing some things, > I've also spoken with members of the Arrow, Iceberg and Parquet > communities as to what a good representation for Arrow Variant would > be like in order to ensure good support and adoption. > > I also looked at the ClickHouse variant implementation [4]. The > ClickHouse Variant is nearly equivalent to the Arrow Dense Union type, > so we don't need to do any extra work there to support it. > > So, after discussions and looking into the needs for engines and so > on, I've iterated and written up a proposal for what a Canonical > Variant Extension Type for Arrow could be in a google doc[5]. I'm > hoping that this can spark some discussion and comments on the > document. If there's relative consensus on it, then I'll work on > creating some implementations of it that I can use to formally propose > the addition to the Canonical Extensions. > > Please take a read and leave comments on the google doc or on this > thread. Thanks everyone! > > --Matt > > [1]: https://github.com/apache/arrow-rs/issues/7063 > [2]: https://github.com/apache/arrow/issues/45937 > [3]: https://github.com/apache/arrow/pull/45375#issuecomment-2649807352 > [4]: > https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse > [5]: > https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing >