alamb commented on issue #3389: URL: https://github.com/apache/arrow-rs/issues/3389#issuecomment-1368864307
Here is my high level plan. Reference: https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc 1. Add an option to `FlightIpcEncoder to "Assign dictionary ids based on pointers", defaults to true 2. If this option is active, the encoder will keep a mapping from actual pointer of the dictionary to ids 3. The encoder will check any new new dictionary arrays encountered for an existing entry in the map 4. If the entry already exists, the dictionary batch that is transmitted will use the pre-existing entry in the map 5. If the entry does not exist, the dictionary will be transmitted and the entry sent with the batch There is also the usecase where there are arrays that have the same logical dictionary but the contents are in different actual arrays. While I thought about adding a feature to directly "normalize" these dictionaries by comparing their values, I would like to avoid this for the first version if possible because: 1. It will be non trivially expensive 2. It can be done as a pre-pass over the record batches for systems that want to trade off additional CPU time for less network bandwidtha 3. I can imagine users might want to do other normalizing operations (like combine / coalesce dictionaries) prior to transmission, but that would be more system specific -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
