alamb commented on issue #5910: URL: https://github.com/apache/arrow-rs/issues/5910#issuecomment-2175848147
BTW the https://docs.rs/arrow/latest/arrow/array/type.StringDictionaryBuilder.html structure has [code](https://docs.rs/arrow-array/52.0.0/src/arrow_array/builder/generic_bytes_dictionary_builder.rs.html#39-45) to do the deduplication quickly So one way to implement a combination of gc and deduplication would be to create a DictionaryArray with a `GenericByteDictionaryBuilder` and then cast back to `StringViewArray` With the code for fast DictionaryArray --> StringViewArray added in https://github.com/apache/arrow-rs/issues/5861, this would only copy the strings once (though it would build up intermediate indexes that maybe could be avoided with a direct approach) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
