It also seems that two variations of the variant encoding are being discussed. The original spec, currently housed in Spark, creates a variant array in row-major order, that is, each element in the array, is contained contiguously. So, if you have objects like `{"a": 7, "b": 3}` then the values for `a` and `b` will be co-located.
There is also a shredded variant, which as I understand it, is not yet fully designed, where a single array is stored in multiple buffers, one per field. This provides for better compression and faster field extraction and is more favorable to the parquet crowd. I think I could potentially see an argument for both variants in Arrow (and capabilities to switch between them). On Thu, Aug 22, 2024 at 8:49 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 22/08/2024 à 17:08, Curt Hagenlocher a écrit : > > > > (I also happen to want a canonical Arrow representation for variant data, > > as this type occurs in many databases but doesn't have a great > > representation today in ADBC results. That's why I filed [Format] > Consider > > adding an official variant type to Arrow · Issue #42069 · apache/arrow > > (github.com) <https://github.com/apache/arrow/issues/42069>. Of course, > > there's no specific reason why a canonical Arrow representation for > > variants must align with Spark and/or Iceberg.) > > Well, one reason is interoperability, especially as converting between > different semi-structured encodings like this would probably be expensive. > > Regards > > Antoine. >