[GitHub] [arrow-datafusion] Dandandan commented on issue #5472: Speed up `DistinctCountAccumulator`

via GitHub Sun, 23 Jul 2023 15:02:25 -0700


Dandandan commented on issue #5472:
URL: 
https://github.com/apache/arrow-datafusion/issues/5472#issuecomment-1646970253


   I tried to prototype a solution. 
   
   I think we need more than just storing a `HashSet` as the distinct count 
needs to be distinct *per group index* not just overall number of distinct 
values (that's why the original has to use a `HashMap` per accumulator). It 
needs to keep track of the unique values per group index and produce them (as 
list) to compute the the final count per group index.
   
   I think an efficient way to store the data for primitives could be the 
following:
   
   ```
   PrimitiveDistinctCountGroupsAccumulator<T: ArrowPrimitiveType> {
       // stores unique group index + index of value in `values` for this group 
index
       uniques: RawTable<(usize, usize)>,
       /// stores the unique values
       values: Vec<T::Native>,
       /// Stores the index (+1) of the first value for index (0 is end of list)
       /// The value is stored in`values`, the next (possible) index in `next`
       /// First item for group index is stored at group index
       first: Vec<usize>,
       /// Stores the location (+1) of the next value (0 is end of list)
       next: Vec<usize>,
       /// Track nulls in the input / filters
       null_state: NullState,
       /// stores number of groups (to produce state)
       total_num_groups: usize,
       /// random state
       random_state: RandomState
   }
   ```
   
   `uniques` stores the group id + index to the value.
   We can use a similar datastructure as in the hash join to store the data in 
a chained list datastructure in a single `Vec`.
   This makes it possible to produce a list array for each group index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #5472: Speed up `DistinctCountAccumulator`

Reply via email to