[
https://issues.apache.org/jira/browse/ARROW-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neville Dipale resolved ARROW-8791.
-----------------------------------
Fix Version/s: 1.0.0
Resolution: Fixed
Issue resolved by pull request 7226
[https://github.com/apache/arrow/pull/7226]
> [Rust] Creating StringDictionaryBuilder with existing dictionary values
> -----------------------------------------------------------------------
>
> Key: ARROW-8791
> URL: https://issues.apache.org/jira/browse/ARROW-8791
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust
> Reporter: Jörn Horstmann
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.0.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> It might be useful to create a DictionaryArray that uses the same dictionary
> keys as another array. One usecase would be more efficient comparison between
> arrays if it is known that they use the same dictionary. Another could be
> more efficient grouping operations, across multiple chunks (ie a
> `Vec<DictionaryArray>`).
>
> A possible implementation could look like this:
>
> {code:java}
> impl<K> StringDictionaryBuilder<K>
> where
> K: ArrowDictionaryKeyType,
> {
> pub fn new_with_dictionary(
> keys_builder: PrimitiveBuilder<K>,
> dictionary_values: &StringArray,
> ) -> Result<Self> {
> let mut values_builder = StringBuilder::with_capacity(
> dictionary_values.len(),
> dictionary_values.value_data().len(),
> );
> let mut map: HashMap<Box<[u8]>, K::Native> = HashMap::new();
> for i in 0..dictionary_values.len() {
> if dictionary_values.is_valid(i) {
> let value = dictionary_values.value(i);
> map.insert(
> value.as_bytes().into(),
> K::Native::from_usize(i)
> .ok_or(ArrowError::DictionaryKeyOverflowError)?,
> );
> values_builder.append_value(value);
> } else {
> values_builder.append_null();
> }
> }
> Ok(Self {
> keys_builder,
> values_builder,
> map,
> })
> }
> }{code}
> I don't really like here that the map has to be reconstructed, maybe there is
> a more efficient way by passing in the HashMap directly, but it's probably
> not a good idea to expose the `Box<[u8]>` encoding of its keys.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)