Hi Dewey,
That's an interesting finding. Indeed, the IPC serialization of data
types (the Schema table) is currently not able to distinguish between
those two cases, simply because the dictionary type is not represented
separately from its value type.
I think there are two possible ways to improve this:
1. An additional field in the DictionaryEncoding table that allows
specifying custom KeyValue metadata for the dictionary value type.
Pros:
- easy to implement
- gracefully degrades to legacy readers that will happily deserialize
the storage type.
Cons:
- does not fully solve the general problem for more complex nestings of
dictionary and extension types (e.g. an extension type with a dictionary
storage type with extension values).
2. A new Dictionary table that participates in the Type union, where the
dictionary index type would be serialized in Field::children[0] and the
value type in Field::children[1].
Pros:
- fully general, as it allows to represent arbitrary nestings of
dictionary and extension types.
Cons:
- implementation is more involved
- legacy readers will not understand this and error out on the
unrecognized type
- writers will have to decide whether to use the new or the old way of
representing dictionaries (the old way being preferable for compatibility).
I would say we probably don't need 2) and can live with 1). But, of
course, perhaps in 5 years we will regret this decision :-D
Regards
Antoine.
Le 10/04/2026 à 17:16, Dewey Dunnington a écrit :
Hi all,
In implementing dictionary decoding for nanoarrow's IPC reader [1] I
discovered that it is not possible to represent a dictionary-encoded
extension type in the IPC schema serialization. I've filed an issue
with the details at [2]...the summary is that a Dictionary with
Extension values is exported identically to a Extension with
Dictionary storage, which usually leads to an error on read (because
no extension types actually support dictionary storage types, except
maybe arrow.opaque because it can have arbitrary storage). I was also
reminded that arrow-rs can't represent dictionary-encoded extension
values at all [3].
Given that there are a number of canonical extension types now, I
wonder if there should be a more clear route to roundtripping
dictionary-encoded extension types over IPC (either by making this
possible to represent in IPC or by making it clear that extension type
implementations must handle dictionary encoded storage). Somewhere in
the middle would be handling the error on deserialization (i.e., if
the extension type in the registry doesn't support dictionary encoded
storage, fall back to a dictionary with extension values).
Cheers,
-dewey
[1] https://github.com/apache/arrow-nanoarrow/pull/861
[2] https://github.com/apache/arrow/issues/49704
[3] https://github.com/apache/arrow-rs/issues/7982