No objections from me!

On Tue, Apr 14, 2026 at 11:08 AM Dewey Dunnington
<[email protected]> wrote:
>
> Hi Antoine,
>
> Option 1 seems reasonable to me...given that we got 10 years out of
> the current specification without anybody noticing, I bet we can get
> another 10 out of KeyValueMetadata on the dictionary encoding :)
>
> If there are no objections I can put together a PR for this (may take
> me a few weeks).
>
> Cheers,
>
> -dewey
>
> On Tue, Apr 14, 2026 at 2:53 AM Antoine Pitrou <[email protected]> wrote:
> >
> >
> > Hi Dewey,
> >
> > That's an interesting finding. Indeed, the IPC serialization of data
> > types (the Schema table) is currently not able to distinguish between
> > those two cases, simply because the dictionary type is not represented
> > separately from its value type.
> >
> > I think there are two possible ways to improve this:
> >
> >
> > 1. An additional field in the DictionaryEncoding table that allows
> > specifying custom KeyValue metadata for the dictionary value type.
> >
> > Pros:
> > - easy to implement
> > - gracefully degrades to legacy readers that will happily deserialize
> > the storage type.
> >
> > Cons:
> > - does not fully solve the general problem for more complex nestings of
> > dictionary and extension types (e.g. an extension type with a dictionary
> > storage type with extension values).
> >
> >
> > 2. A new Dictionary table that participates in the Type union, where the
> > dictionary index type would be serialized in Field::children[0] and the
> > value type in Field::children[1].
> >
> > Pros:
> > - fully general, as it allows to represent arbitrary nestings of
> > dictionary and extension types.
> >
> > Cons:
> > - implementation is more involved
> > - legacy readers will not understand this and error out on the
> > unrecognized type
> > - writers will have to decide whether to use the new or the old way of
> > representing dictionaries (the old way being preferable for compatibility).
> >
> >
> > I would say we probably don't need 2) and can live with 1). But, of
> > course, perhaps in 5 years we will regret this decision :-D
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 10/04/2026 à 17:16, Dewey Dunnington a écrit :
> > > Hi all,
> > >
> > > In implementing dictionary decoding for nanoarrow's IPC reader [1] I
> > > discovered that it is not possible to represent a dictionary-encoded
> > > extension type in the IPC schema serialization. I've filed an issue
> > > with the details at [2]...the summary is that a Dictionary with
> > > Extension values is exported identically to a Extension with
> > > Dictionary storage, which usually leads to an error on read (because
> > > no extension types actually support dictionary storage types, except
> > > maybe arrow.opaque because it can have arbitrary storage). I was also
> > > reminded that arrow-rs can't represent dictionary-encoded extension
> > > values at all [3].
> > >
> > > Given that there are a number of canonical extension types now, I
> > > wonder if there should be a more clear route to roundtripping
> > > dictionary-encoded extension types over IPC (either by making this
> > > possible to represent in IPC or by making it clear that extension type
> > > implementations must handle dictionary encoded storage). Somewhere in
> > > the middle would be handling the error on deserialization (i.e., if
> > > the extension type in the registry doesn't support dictionary encoded
> > > storage, fall back to a dictionary with extension values).
> > >
> > > Cheers,
> > >
> > > -dewey
> > >
> > > [1] https://github.com/apache/arrow-nanoarrow/pull/861
> > > [2] https://github.com/apache/arrow/issues/49704
> > > [3] https://github.com/apache/arrow-rs/issues/7982
> >

Reply via email to