alamb opened a new issue #504:
URL: https://github.com/apache/arrow-rs/issues/504


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   For dictionary arrays, the code for the `concat` simply copies the data from 
multiple places into a single contigous block: 
https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L55
   
   For dictionary arrays, this results in potentially duplicating the 
dictionary values. For example:
   
   Array1:  "A, C, A, B"
   Dictionary: `{A, B, C}`
   Values: `{0, 2, 0, 1}`
   
   Array2:  "C, C, A"
   Dictionary: `{A, B, C}`
   Values: `{2, 2, 0}`
   
   Note the dictionary is the same  between Array 1 and Array 2. Concatenating 
them together results in
   Array3:  "A, C, A, B, C, C, A"
   Dictionary: `{A, B, C, A, B, C}`
   Values: `{0, 2, 0, 1, 6, 6, 4}`
   
   In IOx it turns out that we concatenate rather large numbers of such arrays 
leading to *very* large duplicated dictionaries (read all about the gory 
details here): https://github.com/influxdata/influxdb_iox/issues/1822
   
   **Describe the solution you'd like**
   When the dictionary has the exact same contents (e.g. same pointer) there is 
no need to copy the dictionary
   
   
   Here is a reproducer showing the actual issue:
   ```rust
       fn test_string_dictionary_repeated_dictionary() {
           let array : DictionaryArray<Int8Type> = vec!["a", "a", "b", 
"c"].into_iter().collect();
           let array_copy: DictionaryArray<Int8Type> = 
array.data().clone().into();
   
           // dictionary is "a", "b", "c"
           assert_eq!(array.values(), &(Arc::new(StringArray::from(vec!["a", 
"b", "c"])) as ArrayRef));
           assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2]));
   
           // concatenate it with itself
           let combined = concat(&[&array_copy as _, &array as _])
               .unwrap();
           let combined = combined
               .as_any()
               .downcast_ref::<DictionaryArray<Int8Type>>()
               .unwrap();
   
           // Since dictionares are the same, no need a copy -- it could be the 
same "a", "b", "c"
           assert_eq!(combined.values(), &(Arc::new(StringArray::from(vec!["a", 
"b", "c"])) as ArrayRef),
                      "Actual: {:#?}", combined);
           assert_eq!(combined.keys(), &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 
1, 2]));
       }
   ```
   
   The actual output shows that the resulting dictionary is `"a", "b", "c", 
"a", "b", "c"` when it was literally the same pointer
   
   ```
   ---- 
compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary 
stdout ----
   thread 
'compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary' 
panicked at 'assertion failed: `(left == right)`
     left: `StringArray
   [
     "a",
     "b",
     "c",
     "a",
     "b",
     "c",
   ]`,
    right: `StringArray
   [
     "a",
     "b",
     "c",
   ]`: Actual: DictionaryArray {keys: PrimitiveArray<Int8>
   [
     0,
     0,
     1,
     2,
     3,
     3,
     4,
     5,
   ] values: StringArray
   [
     "a",
     "b",
     "c",
     "a",
     "b",
     "c",
   ]}
   ', arrow/src/compute/kernels/concat.rs:528:9
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   
   ```
   
   **Describe alternatives you've considered**
   There are times when the dictionary may not be the exact same pointer, but 
contain mostly/all the same values which would also be useful to compact, but I 
will file that under a separate ticket as it will likely involve a tradeoff 
between CPU and memory
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to