[GitHub] [arrow-rs] jhorstmann commented on issue #506: "Optimize" Dictionary contents in DictionaryArray / `concat_batches`

GitBox Mon, 28 Jun 2021 13:26:35 -0700


jhorstmann commented on issue #506:
URL: https://github.com/apache/arrow-rs/issues/506#issuecomment-870014675



   >     1. b) Every value in the dictionary has at least one use in the array' 
values
   
   A nice benefit of this is that a GROUP BY that dictionary column afterwards 
would be very cheap since it does not need another hashmap and instead could 
index directly into an array of accumulators with the keys. Not sure if that is 
the usecase you are after or if this is more of a nice side effect.
   
   Ensuring sorted dictionaries is something I'm definitely interested in, 
`Field` already has the `dict_is_ordered` flag based on which a much faster 
implementation of sort comparator or comparison kernel could be selected. I was 
thinking of a different implementation than using a BTreeSet though. I have 
only a rough sketch, but the idea is to use `sort_to_indices` on the dictionary 
values and then somehow build a lookup table as a vector. With the sorted 
indices it should also be possible to build a lookup table for remapping 
duplicates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] jhorstmann commented on issue #506: "Optimize" Dictionary contents in DictionaryArray / `concat_batches`

Reply via email to