Hi all,

In implementing dictionary decoding for nanoarrow's IPC reader [1] I
discovered that it is not possible to represent a dictionary-encoded
extension type in the IPC schema serialization. I've filed an issue
with the details at [2]...the summary is that a Dictionary with
Extension values is exported identically to a Extension with
Dictionary storage, which usually leads to an error on read (because
no extension types actually support dictionary storage types, except
maybe arrow.opaque because it can have arbitrary storage). I was also
reminded that arrow-rs can't represent dictionary-encoded extension
values at all [3].

Given that there are a number of canonical extension types now, I
wonder if there should be a more clear route to roundtripping
dictionary-encoded extension types over IPC (either by making this
possible to represent in IPC or by making it clear that extension type
implementations must handle dictionary encoded storage). Somewhere in
the middle would be handling the error on deserialization (i.e., if
the extension type in the registry doesn't support dictionary encoded
storage, fall back to a dictionary with extension values).

Cheers,

-dewey

[1] https://github.com/apache/arrow-nanoarrow/pull/861
[2] https://github.com/apache/arrow/issues/49704
[3] https://github.com/apache/arrow-rs/issues/7982

Reply via email to