alamb opened a new issue #504: URL: https://github.com/apache/arrow-rs/issues/504
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** For dictionary arrays, the code for the `concat` simply copies the data from multiple places into a single contigous block: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/concat.rs#L55 For dictionary arrays, this results in potentially duplicating the dictionary values. For example: Array1: "A, C, A, B" Dictionary: `{A, B, C}` Values: `{0, 2, 0, 1}` Array2: "C, C, A" Dictionary: `{A, B, C}` Values: `{2, 2, 0}` Note the dictionary is the same between Array 1 and Array 2. Concatenating them together results in Array3: "A, C, A, B, C, C, A" Dictionary: `{A, B, C, A, B, C}` Values: `{0, 2, 0, 1, 6, 6, 4}` In IOx it turns out that we concatenate rather large numbers of such arrays leading to *very* large duplicated dictionaries (read all about the gory details here): https://github.com/influxdata/influxdb_iox/issues/1822 **Describe the solution you'd like** When the dictionary has the exact same contents (e.g. same pointer) there is no need to copy the dictionary Here is a reproducer showing the actual issue: ```rust fn test_string_dictionary_repeated_dictionary() { let array : DictionaryArray<Int8Type> = vec!["a", "a", "b", "c"].into_iter().collect(); let array_copy: DictionaryArray<Int8Type> = array.data().clone().into(); // dictionary is "a", "b", "c" assert_eq!(array.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef)); assert_eq!(array.keys(), &Int8Array::from(vec![0, 0, 1, 2])); // concatenate it with itself let combined = concat(&[&array_copy as _, &array as _]) .unwrap(); let combined = combined .as_any() .downcast_ref::<DictionaryArray<Int8Type>>() .unwrap(); // Since dictionares are the same, no need a copy -- it could be the same "a", "b", "c" assert_eq!(combined.values(), &(Arc::new(StringArray::from(vec!["a", "b", "c"])) as ArrayRef), "Actual: {:#?}", combined); assert_eq!(combined.keys(), &Int8Array::from(vec![0, 0, 1, 2, 0, 0, 1, 2])); } ``` The actual output shows that the resulting dictionary is `"a", "b", "c", "a", "b", "c"` when it was literally the same pointer ``` ---- compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary stdout ---- thread 'compute::kernels::concat::tests::test_string_dictionary_repeated_dictionary' panicked at 'assertion failed: `(left == right)` left: `StringArray [ "a", "b", "c", "a", "b", "c", ]`, right: `StringArray [ "a", "b", "c", ]`: Actual: DictionaryArray {keys: PrimitiveArray<Int8> [ 0, 0, 1, 2, 3, 3, 4, 5, ] values: StringArray [ "a", "b", "c", "a", "b", "c", ]} ', arrow/src/compute/kernels/concat.rs:528:9 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` **Describe alternatives you've considered** There are times when the dictionary may not be the exact same pointer, but contain mostly/all the same values which would also be useful to compact, but I will file that under a separate ticket as it will likely involve a tradeoff between CPU and memory -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
