Le 12/05/2025 à 18:20, Matt Topol a écrit :
> It's not just Parquet Variant, it's also Iceberg (which has
> standardized on this) and Spark in-memory (where this encoding scheme
> originated).

Ok, but it's called Parquet Variant now, since that's where the binary spec lives:
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
https://github.com/apache/parquet-format/blob/master/VariantShredding.md

As far as relying on union types, the reason we can't do so is because
the specific purpose of this Variant type is that we don't know the
types up front, it's dynamic. Also Arrow union types do not allow for> 
arbitrarily recursive structures, it's one of the issues that others
have run into with the Arrow union types. Another consideration is the
goal of projecting the shredded columns. Since we need to project
shredded columns, we'd only be able to use Sparse Unions, not dense
ones, and while it is well defined how to project the child of a
struct it's less defined how to project the child of a union.

Those are good points, thank you.

Regards

Antoine.

Reply via email to