[GitHub] [arrow-datafusion] Dandandan commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

GitBox Wed, 28 Jul 2021 04:40:22 -0700


Dandandan commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888229063



   > > we can keep the keys in a contiguous (mutable) array instead and keep 
offsets/pointers to the values in this array (and null values can be stored in 
a bitmap, so only 1 bit per value).
   > 
   > @Dandandan this is a great idea - I will write up in more detail what I 
think you are proposing to make sure we are on the same page
   
   Some more details to get you started, this is the rough idea that is a 
similar to how hash join currently works (there it's a bit easier as we can 
concatenate the data upfront):
   
   * We can store both group by keys and aggregate values in typed / mutable 
Arrays. This means the memory overhead is kept to a minimum and is much more 
vectorization / cache friendly.
   E.g. for a count grouped on u64 values, the state we keep for the group by 
values is `MutableArray<U64Type>` and for the count state another 
`MutableArray<U64Type>` (no bitmap needed here). The group by values can have 
null values which are stored in a bitmap.
   
   * A hashmap that only contains hashes / pointers: `Hashmap<u64, 
SmallVec<u64>`
      - first `u64` is the hash of the group by values -> this can be 
calculated in a vectorized manner for the entire batch
      - second `u64` is the offset in both the group by value(s) and aggregate 
state arrays -> inserted to the mutable arrays when there is no match
   * hash collisions (group by values mapping to same `u64` should/can be 
handled by comparing values in the arrays at the group by offset and scanning 
the `SmallVec`
   
   Above structure will reduce the memory needed for the state, and will also 
reduce the time to create / (re)hash the keys and will reduce the amount of 
intermediate vec / small allocations that are needed in the hash aggregate code.
   
   There are maybe slightly different ways to encode the data in the hashmap / 
check collisions, and above structure makes some further optimizations / 
vectorization possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Reply via email to