When I went through the parquet variant spec, I found that an arrow
extension type might be a must because decoding the parquet row
by row is so inefficient.

I've draft a decoding tool in parquet c++ and ready for review now [1]

[1] https://github.com/apache/arrow/pull/46372

Best,
Xuwei Fu

Matt Topol <zotthewiz...@gmail.com> 于2025年5月9日周五 06:03写道:

> Hey All,
>
> There's been various discussions occurring on many different thread
> locations (issues, PRs, and so on)[1][2][3], and more that I haven't
> linked to, concerning what a canonical Variant Extension Type for
> Arrow might look like. As I've looked into implementing some things,
> I've also spoken with members of the Arrow, Iceberg and Parquet
> communities as to what a good representation for Arrow Variant would
> be like in order to ensure good support and adoption.
>
> I also looked at the ClickHouse variant implementation [4]. The
> ClickHouse Variant is nearly equivalent to the Arrow Dense Union type,
> so we don't need to do any extra work there to support it.
>
> So, after discussions and looking into the needs for engines and so
> on, I've iterated and written up a proposal for what a Canonical
> Variant Extension Type for Arrow could be in a google doc[5]. I'm
> hoping that this can spark some discussion and comments on the
> document. If there's relative consensus on it, then I'll work on
> creating some implementations of it that I can use to formally propose
> the addition to the Canonical Extensions.
>
> Please take a read and leave comments on the google doc or on this
> thread. Thanks everyone!
>
> --Matt
>
> [1]: https://github.com/apache/arrow-rs/issues/7063
> [2]: https://github.com/apache/arrow/issues/45937
> [3]: https://github.com/apache/arrow/pull/45375#issuecomment-2649807352
> [4]:
> https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse
> [5]:
> https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing
>

Reply via email to