[GitHub] [arrow-rs] alamb commented on issue #3389: Support DictionaryArrays in Arrow Flight

GitBox Mon, 02 Jan 2023 03:27:08 -0800


alamb commented on issue #3389:
URL: https://github.com/apache/arrow-rs/issues/3389#issuecomment-1368864307

Here is my high level plan.

Reference:
https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc

1. Add an option to `FlightIpcEncoder to "Assign dictionary ids based on
pointers", defaults to true
2. If this option is active, the encoder will keep a mapping from actual
pointer of the dictionary to ids
3. The encoder will check any new new dictionary arrays encountered for an
existing entry in the map
4. If the entry already exists, the dictionary batch that is transmitted
will use the pre-existing entry in the map
5. If the entry does not exist, the dictionary will be transmitted and the
entry sent with the batch

There is also the usecase where there are arrays that have the same logical
dictionary but the contents are in different actual arrays. While I thought
about adding a feature to directly "normalize" these dictionaries by comparing
their values, I would like to avoid this for the first version if possible
because:

1. It will be non trivially expensive
2. It can be done as a pre-pass over the record batches for systems that
want to trade off additional CPU time for less network bandwidtha
3. I can imagine users might want to do other normalizing operations (like
combine / coalesce dictionaries) prior to transmission, but that would be more
system specific

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on issue #3389: Support DictionaryArrays in Arrow Flight

Reply via email to