tustvold opened a new issue #1218:
URL: https://github.com/apache/arrow-rs/issues/1218


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Currently when casting an array to DictionaryArray, the code will compute a 
new dictionary for the type. This dictionary will have unique values, but won't 
be sorted.
   
   However, in some cases uniqueness and/or sortedness may not be a priority, 
e.g. because a subsequent operation is going to filter out a large number of 
potential matches, and computing this dictionary is therefore wasted effort.
   
   **Describe the solution you'd like**
   
   Add two new CastOptions:
   
   * `sort_dictionary` - if the result is a dictionary array, the dictionary 
will be sorted
   * `pack_dictionary` - if the result is a dictionary array, the dictionary 
will be unique
   
   This will give the cast kernel the leeway to construct a DictionaryArray, by 
taking the provided array as the dictionary child data (values), and encoding 
`0..array.len()` in the keys array. This will of course need to fallback to 
computing a packed dictionary if the key size is too small to accommodate this.
   
   This will also provide an obvious way to implement (#506) as an array could 
be cast to itself with options to sort and/or pack the dictionary. This could 
be further combined with #1217 to avoid doing this computation if not necessary.
   
   **Additional Context**
   
   The concat kernel currently takes a similar approach of avoiding recomputing 
dictionaries
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to