Hi Dewey,

That's an interesting finding. Indeed, the IPC serialization of data types (the Schema table) is currently not able to distinguish between those two cases, simply because the dictionary type is not represented separately from its value type.

I think there are two possible ways to improve this:


1. An additional field in the DictionaryEncoding table that allows specifying custom KeyValue metadata for the dictionary value type.

Pros:
- easy to implement
- gracefully degrades to legacy readers that will happily deserialize the storage type.

Cons:
- does not fully solve the general problem for more complex nestings of dictionary and extension types (e.g. an extension type with a dictionary storage type with extension values).


2. A new Dictionary table that participates in the Type union, where the dictionary index type would be serialized in Field::children[0] and the value type in Field::children[1].

Pros:
- fully general, as it allows to represent arbitrary nestings of dictionary and extension types.

Cons:
- implementation is more involved
- legacy readers will not understand this and error out on the unrecognized type - writers will have to decide whether to use the new or the old way of representing dictionaries (the old way being preferable for compatibility).


I would say we probably don't need 2) and can live with 1). But, of course, perhaps in 5 years we will regret this decision :-D

Regards

Antoine.


Le 10/04/2026 à 17:16, Dewey Dunnington a écrit :
Hi all,

In implementing dictionary decoding for nanoarrow's IPC reader [1] I
discovered that it is not possible to represent a dictionary-encoded
extension type in the IPC schema serialization. I've filed an issue
with the details at [2]...the summary is that a Dictionary with
Extension values is exported identically to a Extension with
Dictionary storage, which usually leads to an error on read (because
no extension types actually support dictionary storage types, except
maybe arrow.opaque because it can have arbitrary storage). I was also
reminded that arrow-rs can't represent dictionary-encoded extension
values at all [3].

Given that there are a number of canonical extension types now, I
wonder if there should be a more clear route to roundtripping
dictionary-encoded extension types over IPC (either by making this
possible to represent in IPC or by making it clear that extension type
implementations must handle dictionary encoded storage). Somewhere in
the middle would be handling the error on deserialization (i.e., if
the extension type in the registry doesn't support dictionary encoded
storage, fall back to a dictionary with extension values).

Cheers,

-dewey

[1] https://github.com/apache/arrow-nanoarrow/pull/861
[2] https://github.com/apache/arrow/issues/49704
[3] https://github.com/apache/arrow-rs/issues/7982

Reply via email to