Thanks Uwe, Wes, glad to hear I'm not too far out there :) The
dictionary batch ordering seems like a reasonable requirement for this
I made a JIRA to add something like this to the integration tests
(https://issues.apache.org/jira/browse/ARROW-2412) and Ill put up a PR
On 04/06/2018 01:43 PM, Wes McKinney wrote:
Having dictionaries-within-dictionaries does add some complexity, but
I think the use case is valid and so it would be good to determine the
best way to handle this in the IPC / messaging protocol.
I would suggest: dictionaries can use other dictionaries, so long as
those dictionaries occur earlier in the stream. I am not sure either
the Java or C++ libraries will be able to properly handle these cases
right now, but that's what we have integration tests for!
On Fri, Apr 6, 2018 at 11:59 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
I would also have considered this a legitimate use of the Arrow specification.
We only specify the DictionaryType to have a dictionary of any Arrow Type. In
the context of Arrow's IPC this seems to be a bit more complicated as we seem
to have the assumption that there is only one type of Dictionary per column. I
would argue that we should be able to support this once we work out a reliable
way to transfer them via the IPC mechanism.
Just as a related thought (might not produce the result you want): In Parquet,
only the values on the lowest level are dictionary-encoded. But this is also
due to the fact that Parquet uses repetition and definition levels to encode
arbitrarily nested data types. These are more space-efficient when they are
correctly encoded but don't provide random access.
On Fri, Apr 6, 2018, at 4:42 PM, Brian Hulette wrote:
I've been considering a use-case with a dictionary-encoded struct
column, which may contain some dictionary-encoded columns itself. More
specifically, in this use-case each row represents a single observation
in a geospatial track, which includes a position, a time, and some
track-level metadata (track id, origin, destination, etc...). I would
like to represent the metadata as a dictionary-encoded struct, since
unique values will be repeated for each observation of that track, and I
would _also_ like to dictionary-encode some of the metadata column's
children, since unique values will typically be repeated in multiple tracks.
I think one could make a (totally legitimate) argument that this is
stretching a format designed for tabular data too far. This use-case
could also be accomplished by breaking out the struct metadata column
into its own arrow table, and managing a new integer column that
references that table. This would look almost identical to what I
initially described, it just wouldn't rely on the arrow libraries to
manage the "dictionary".
The spec doesn't have anything to say on this topic as far as I can
tell, but our implementations don't currently allow a dictionary-encoded
column's children to be dictionary-encoded themselves . Is this just
a simplifying assumption, or a hard rule that should be codified in the