It also seems that two variations of the variant encoding are being
discussed.  The original spec, currently housed in Spark, creates a variant
array in row-major order, that is, each element in the array, is contained
contiguously.  So, if you have objects like `{"a": 7, "b": 3}` then the
values for `a` and `b` will be co-located.

There is also a shredded variant, which as I understand it, is not yet
fully designed, where a single array is stored in multiple buffers, one per
field.  This provides for better compression and faster field extraction
and is more favorable to the parquet crowd.

I think I could potentially see an argument for both variants in Arrow (and
capabilities to switch between them).


On Thu, Aug 22, 2024 at 8:49 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 22/08/2024 à 17:08, Curt Hagenlocher a écrit :
> >
> > (I also happen to want a canonical Arrow representation for variant data,
> > as this type occurs in many databases but doesn't have a great
> > representation today in ADBC results. That's why I filed [Format]
> Consider
> > adding an official variant type to Arrow · Issue #42069 · apache/arrow
> > (github.com) <https://github.com/apache/arrow/issues/42069>. Of course,
> > there's no specific reason why a canonical Arrow representation for
> > variants must align with Spark and/or Iceberg.)
>
> Well, one reason is interoperability, especially as converting between
> different semi-structured encodings like this would probably be expensive.
>
> Regards
>
> Antoine.
>

Reply via email to