alamb opened a new issue #506:
URL: https://github.com/apache/arrow-rs/issues/506


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Our project (IOx) uses dictionary arrays heavily to compress arrays of 
strings (that are mostly the same value). When `DictionaryArrays` are 
concatenated together the contents of the dictionaries are also concatenated 
resulting in duplication and some dictionary entries which are unused.
   
   Similarly, when filtering a DictionaryArray some values may be filtered 
entirely and thus the dictionary may include values that are repeated.
   
   **Describe the solution you'd like**
   1. Add an `optimize_dictionaries` function which would return a potentially 
new array such that its dictionary had the following properties:
   a) Every value in the dictionary is unique
   b) Every value in the dictionary has at least one use in the array' values
   
   So given an input like
   
   Array:  "A, C, A, B, C, C, A"
   Dictionary: `{A, B, C, D, E, F, A, B, C}`
   Values: `{0, 2, 0, 1, 2, 2, 0}`
   
   The output of `optimize_dictionaroes` would be (note no duplication in 
dictionary)
   Array:  "A, C, A, B, C, C, A"
   Dictionary: `{A, B, C}`
   Values: `{0, 2, 0, 1, 2, 2, 0}`
   
   
   2. Add a function such as `concat_and_optimize_batches` that would do the 
array compaction as part of `concatenation` (and thus avoid creating an 
intermediate `concat` result which was then optimized)
   
   The solution described in https://github.com/apache/arrow-rs/issues/504 is 
designed to avoid copying dictionaries when they are identical (aka literally 
point at the same array)
   
   **Describe alternatives you've considered**
   An alternative might be to extend `concat_batches` directly to always try 
and compact arrays. However, since this involves a tradeoff of more CPU for 
potentially less memory usage, I am not sure it is appropriate to always apply.
   
   **Additional context**
   
   @tustvold has implemented a version of this code in IOx: 
https://github.com/influxdata/influxdb_iox/pull/1830 which could be suitable 
for inclusion in arrow
   
   See also https://github.com/apache/arrow-rs/issues/504,
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to