Allow dictionary-encoded children?

Brian Hulette Fri, 06 Apr 2018 07:43:17 -0700

I've been considering a use-case with a dictionary-encoded structcolumn, which may contain some dictionary-encoded columns itself. Morespecifically, in this use-case each row represents a single observationin a geospatial track, which includes a position, a time, and sometrack-level metadata (track id, origin, destination, etc...). I wouldlike to represent the metadata as a dictionary-encoded struct, sinceunique values will be repeated for each observation of that track, and Iwould _also_ like to dictionary-encode some of the metadata column'schildren, since unique values will typically be repeated in multiple tracks.

I think one could make a (totally legitimate) argument that this isstretching a format designed for tabular data too far. This use-casecould also be accomplished by breaking out the struct metadata columninto its own arrow table, and managing a new integer column thatreferences that table. This would look almost identical to what Iinitially described, it just wouldn't rely on the arrow libraries tomanage the "dictionary".

The spec doesn't have anything to say on this topic as far as I cantell, but our implementations don't currently allow a dictionary-encodedcolumn's children to be dictionary-encoded themselves [1]. Is this justa simplifying assumption, or a hard rule that should be codified in thespec?


Thanks,
Brian

[1]https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824

Allow dictionary-encoded children?

Reply via email to