[ 
https://issues.apache.org/jira/browse/ARROW-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-8791:
-------------------------------------

    Assignee: Jörn Horstmann

> [Rust] Creating StringDictionaryBuilder with existing dictionary values
> -----------------------------------------------------------------------
>
>                 Key: ARROW-8791
>                 URL: https://issues.apache.org/jira/browse/ARROW-8791
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Jörn Horstmann
>            Assignee: Jörn Horstmann
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> It might be useful to create a DictionaryArray that uses the same dictionary 
> keys as another array. One usecase would be more efficient comparison between 
> arrays if it is known that they use the same dictionary. Another could be 
> more efficient grouping operations, across multiple chunks (ie a 
> `Vec<DictionaryArray>`).
>  
> A possible implementation could look like this:
>  
> {code:java}
> impl<K> StringDictionaryBuilder<K>
> where
>     K: ArrowDictionaryKeyType,
> {
>     pub fn new_with_dictionary(
>         keys_builder: PrimitiveBuilder<K>,
>         dictionary_values: &StringArray,
>     ) -> Result<Self> {
>         let mut values_builder = StringBuilder::with_capacity(
>             dictionary_values.len(),
>             dictionary_values.value_data().len(),
>         );
>         let mut map: HashMap<Box<[u8]>, K::Native> = HashMap::new();
>         for i in 0..dictionary_values.len() {
>             if dictionary_values.is_valid(i) {
>                 let value = dictionary_values.value(i);
>                 map.insert(
>                     value.as_bytes().into(),
>                     K::Native::from_usize(i)
>                         .ok_or(ArrowError::DictionaryKeyOverflowError)?,
>                 );
>                 values_builder.append_value(value);
>             } else {
>                 values_builder.append_null();
>             }
>         }
>         Ok(Self {
>             keys_builder,
>             values_builder,
>             map,
>         })
>     }
> }{code}
> I don't really like here that the map has to be reconstructed, maybe there is 
> a more efficient way by passing in the HashMap directly, but it's probably 
> not a good idea to expose the `Box<[u8]>` encoding of its keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to