alamb commented on issue #3389:
URL: https://github.com/apache/arrow-rs/issues/3389#issuecomment-1368864307

   Here is my high level plan.
   
   Reference:  
https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
   
   1. Add an option to `FlightIpcEncoder to "Assign dictionary ids based on 
pointers", defaults to true
   2. If this option is active, the encoder will keep a mapping from actual 
pointer of the dictionary  to ids
   3. The encoder will check any new new dictionary arrays encountered for an 
existing entry in the map
   4. If the entry already exists, the dictionary batch that is transmitted 
will use the pre-existing entry in the map
   5. If the entry does not exist, the dictionary will be transmitted and the 
entry sent with the batch
   
   
   There is also the usecase where there are arrays that have the same logical 
dictionary but the contents are in different actual arrays. While I thought 
about adding a feature to directly "normalize" these dictionaries by comparing 
their values, I would like to avoid this for the first version if possible 
because:
   
   1. It will be non trivially expensive
   2. It can be done as a pre-pass over the record batches for systems that 
want to trade off additional CPU time for less network bandwidtha
   3. I can imagine users might want to do other normalizing operations (like 
combine / coalesce dictionaries) prior to transmission, but that would be more 
system specific
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to