wjones127 commented on issue #42069:
URL: https://github.com/apache/arrow/issues/42069#issuecomment-2166130179
## An Arrow extension type?
In the near term, I think this would make a good Arrow extension type. This
would be:
```
struct<
metadata: dictionary<binary>,
data: binary
>
```
The metadata will usually be a single binary shared across all rows, but
could be multiple. (Multiple might happen if two different batches are
concatenated together, for example.) Either dictionary or REE encoded array
would be appropriate.
The data could be either binary, large binary, or binary view.
Binary view isn’t widely supported right now, but could be very useful for
this data type. This is because sub-objects can be sliced out of variants. From
the spec [^1]:
> Another motivation for the representation is that (aside from metadata)
each inner Variant value is contiguous and self-contained. For example, in a
Variant containing an Array of Variant values, the representation of an inner
Variant value, when paired with the metadata of the full variant, is itself a
valid Variant.
[^1]: https://github.com/apache/spark/blob/master/common/variant/README.md
## Where could this be useful?
A few immediate places I think this extension type could be useful:
- Roundtrip variant Arrow ↔ Spark
- Spark Connect (and any ADBC connector to that) would benefit from this
- Extension type in PyArrow, roundtrip PySpark ↔ PyArrow
- DataFusion function library (I’m experimenting with that now)
* There's been substantial interest in DataFusion community for a way to
handle semi-structured data efficiently.
## Extension type pitfalls
The main pitfall of using an extension type for this is the storage type is
meaningless to users. They need to have special libraries to interpret the
bytes if pulled into a system that doesn't understand the variant extension
type.
In addition, most existing Arrow systems I've worked with don't have a way
to customize how extension arrays are printed. I think this is something we
should fix. A reasonable workaround in the meantime is providing functions that
convert these back to JSON strings for the purpose of printing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]