tustvold opened a new issue #1206:
URL: https://github.com/apache/arrow-rs/issues/1206


   **Which part is this question about**
   
   The `Field` data structure contains a `dict_id` member, that stores an i64. 
It appears the intention of this is that different dictionaries will have 
different IDs, unfortunately this appears to be a quirk of the IPC format and 
isn't widely utilised by arrow-rs. 
   
   **Describe your question**
   
   Most of arrow-rs is completely agnostic to dict_id, with compute kernels 
completely ignoring it, even those that recompute dictionaries such as concat.
   
   The only parts of the stack that appear to use the dict_ids are the IPC 
interfaces, which will error if they encounter the same dict_id multiple times. 
I think this is inconsistency is a tad confusing, I think we should do one of 
the following:
   
   * Keep the current agnosticism within arrow-rs and assign IDs in the writers 
(potentially using Arc::ptr_eq on the values array)
   * Make arrow-rs respect dict_ids
   
   Of these the first would definitely be simpler to implement, but I'm not 
familiar enough with the purpose of dict_id to be certain there isn't some 
use-case this would preclude?
   
   **Additional context**
   
   As `Field` is part of the `Schema`, RecordBatch with different dict_id will 
appear to have different schema. This may have downstream implications for 
things like DataFusion which have strong assumptions on schema consistency 
within a plan.
   
   This cropped up in https://github.com/apache/arrow-datafusion/pull/1596 as 
it is using the arrow IPC format to spill buffers to disk.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to