No objections from me!
On Tue, Apr 14, 2026 at 11:08 AM Dewey Dunnington <[email protected]> wrote: > > Hi Antoine, > > Option 1 seems reasonable to me...given that we got 10 years out of > the current specification without anybody noticing, I bet we can get > another 10 out of KeyValueMetadata on the dictionary encoding :) > > If there are no objections I can put together a PR for this (may take > me a few weeks). > > Cheers, > > -dewey > > On Tue, Apr 14, 2026 at 2:53 AM Antoine Pitrou <[email protected]> wrote: > > > > > > Hi Dewey, > > > > That's an interesting finding. Indeed, the IPC serialization of data > > types (the Schema table) is currently not able to distinguish between > > those two cases, simply because the dictionary type is not represented > > separately from its value type. > > > > I think there are two possible ways to improve this: > > > > > > 1. An additional field in the DictionaryEncoding table that allows > > specifying custom KeyValue metadata for the dictionary value type. > > > > Pros: > > - easy to implement > > - gracefully degrades to legacy readers that will happily deserialize > > the storage type. > > > > Cons: > > - does not fully solve the general problem for more complex nestings of > > dictionary and extension types (e.g. an extension type with a dictionary > > storage type with extension values). > > > > > > 2. A new Dictionary table that participates in the Type union, where the > > dictionary index type would be serialized in Field::children[0] and the > > value type in Field::children[1]. > > > > Pros: > > - fully general, as it allows to represent arbitrary nestings of > > dictionary and extension types. > > > > Cons: > > - implementation is more involved > > - legacy readers will not understand this and error out on the > > unrecognized type > > - writers will have to decide whether to use the new or the old way of > > representing dictionaries (the old way being preferable for compatibility). > > > > > > I would say we probably don't need 2) and can live with 1). But, of > > course, perhaps in 5 years we will regret this decision :-D > > > > Regards > > > > Antoine. > > > > > > Le 10/04/2026 à 17:16, Dewey Dunnington a écrit : > > > Hi all, > > > > > > In implementing dictionary decoding for nanoarrow's IPC reader [1] I > > > discovered that it is not possible to represent a dictionary-encoded > > > extension type in the IPC schema serialization. I've filed an issue > > > with the details at [2]...the summary is that a Dictionary with > > > Extension values is exported identically to a Extension with > > > Dictionary storage, which usually leads to an error on read (because > > > no extension types actually support dictionary storage types, except > > > maybe arrow.opaque because it can have arbitrary storage). I was also > > > reminded that arrow-rs can't represent dictionary-encoded extension > > > values at all [3]. > > > > > > Given that there are a number of canonical extension types now, I > > > wonder if there should be a more clear route to roundtripping > > > dictionary-encoded extension types over IPC (either by making this > > > possible to represent in IPC or by making it clear that extension type > > > implementations must handle dictionary encoded storage). Somewhere in > > > the middle would be handling the error on deserialization (i.e., if > > > the extension type in the registry doesn't support dictionary encoded > > > storage, fall back to a dictionary with extension values). > > > > > > Cheers, > > > > > > -dewey > > > > > > [1] https://github.com/apache/arrow-nanoarrow/pull/861 > > > [2] https://github.com/apache/arrow/issues/49704 > > > [3] https://github.com/apache/arrow-rs/issues/7982 > >
