alamb commented on issue #5910:
URL: https://github.com/apache/arrow-rs/issues/5910#issuecomment-2175848147

   BTW the 
https://docs.rs/arrow/latest/arrow/array/type.StringDictionaryBuilder.html 
structure has 
[code](https://docs.rs/arrow-array/52.0.0/src/arrow_array/builder/generic_bytes_dictionary_builder.rs.html#39-45)
 to do the deduplication quickly 
   
   So one way to implement a combination of gc and deduplication would be to 
create a DictionaryArray with a `GenericByteDictionaryBuilder` and then cast 
back to `StringViewArray`
   
   With the code for fast DictionaryArray --> StringViewArray added in 
https://github.com/apache/arrow-rs/issues/5861, this would only copy the 
strings once (though it would build up intermediate indexes that maybe could be 
avoided with a direct approach)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to