[ 
https://issues.apache.org/jira/browse/ARROW-12044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393473#comment-17393473
 ] 

Michal Nowakiewicz edited comment on ARROW-12044 at 8/4/21, 10:38 PM:
----------------------------------------------------------------------

The description above suggests that this can potentially be solved at a level 
above Grouper using a computed field representing target partition id. A note 
regarding partitioning and group by: it is important to not use the same hash 
bits coming from the same hash function for both operations. Otherwise the 
number of hash conflicts in the hash table can grow big.

This issue is pointing out to an existing problem that needs to be solved. 
Currently Grouper has been designed and works only for up to around 10 million 
groups or for up to 4GB of data, whichever limit is hit first. 

There are multiple ways how Grouper can be extended to work with arbitrary 
number of groups up to the limit of available RAM. One solution is to just 
change current hash table to use 64-bit offset / hashes / group ids. From the 
experiments I did a few months ago this is not the most efficient way of doing 
group by aggregation for large hash tables - solutions based on partitioning of 
input data prior to grouping seem to perform better for >4GB hash tables. The 
performance bottleneck here is related to cache misses, including costly TLB 
cache misses, that happen on almost every lookup if the input has somewhat 
uniformly distributed keys. 

The other two approaches, which seem appealing to me, are based on the same 
idea of representing a large hash table as a collection of smaller hash tables. 
Each smaller hash table in the collection stores data for a distinct range of 
hashes and the sum of all the ranges covers the entire set of possible hash 
values. The size of the hash range for smaller hash tables can potentially vary 
from one to another within a collection.

One approach is to use a sequence of merges. Once a small hash table gets full, 
it is appended to the list of merge inputs, and a fresh empty hash table 
replaces it for subsequent processing of input exec batches. Fragments of merge 
input hash tables get merged together to form a new hash table that represents 
combined results. Fragments that are merged may represent a subset of hashes 
from input hash tables. A merge of multiple hash tables with hashes from 0 to 
1023 may result for instance in a pair of hash tables representing combined 
grouped aggregation results, one for hashes from 0 to 511 and the other from 
512 to 1023. 

Second approach is to use partitioning of data. The input keys either before or 
after initial aggregation are put into buckets corresponding to partitions. 
Each partition covers a distinct range of hash values but the number of 
partitions and size of the range are fixed. Then grouped aggregation would be 
done separately within each partition using a target hash table for that 
partition. Recursive partitioning or dynamic adjustments to the number of 
partitions may be used if the final size of the data cannot be correctly 
estimated.

Merging and partitioning are somewhat symmetrical. It is not immediately clear 
to me that one is better than the other.
 So in summary the design choices I see are: a) whether a single hash table or 
a collection of hash tables for disjoint hash ranges should represent the 
aggregation results (my preference is right now on the collection, which seems 
to give more flexibility in the future), b) whether to use merge based or 
partition based approach (or neither). 


was (Author: michalno):
The description above suggests that this can potentially be solved at a level 
above Grouper using a computed field representing target partition id. A note 
regarding partitioning and group by: it is important to not use the same hash 
bits coming from the same hash function for both operations. Otherwise the 
number of hash conflicts in the hash table can grow big.

This issue is pointing out to an existing problem that needs to be solved. 
Currently Grouper has been designed and works only for up to around 10 million 
groups or for up to 4GB of data, whichever limit is hit first. 

There are multiple ways how Grouper can be extended to work with arbitrary 
number of groups up to the limit of available RAM. One solution is to just 
change current hash table to use 64-bit offset / hashes / group ids. From the 
experiments I did a few months ago this is not the most efficient way of doing 
group by aggregation for large hash tables - solutions based on partitioning of 
input data prior to grouping seem to perform better for >4GB hash tables. The 
performance bottleneck here is related to cache misses, including costly TLB 
cache misses, that happen on almost every lookup if the input has somewhat 
uniformly distributed keys. 

The other two approaches, which seem appealing to me, are based on the same 
idea of representing a large hash table as a collection of smaller hash tables. 
Each smaller hash table in the collection stores data for a distinct range of 
hashes and the sum of all the ranges covers the entire set of possible hash 
values. The size of the hash range for smaller hash tables can potentially vary 
from one to another within a collection.

One approach is to use a sequence of merges. Once a small hash table gets full, 
it is appended to the list of merge inputs, and a fresh empty hash table 
replaces it for subsequent processing of input exec batches. Fragments of merge 
input hash tables get merged together to form a new hash table that represents 
combined results. Fragments that are merged may represent a subset of hashes 
from input hash tables. A merge of multiple hash tables with hashes from 0 to 
1023 may result for instance in a pair of hash tables representing combined 
grouped aggregation results, one for hashes from 0 to 511 and the other from 
512 to 1023. 

Second approach is to use partitioning of data. The input keys either before or 
after initial aggregation are put into buckets corresponding to partitions. 
Each partition covers a distinct range of hash values but the number of 
partitions and size of the range are fixed. Then grouped aggregation would be 
done separately within each partition using a target hash table for that 
partition. Recursive partitioning or dynamic adjustments to the number of 
partitions may be used if the final size of the data cannot be correctly 
estimated.

Merging and partitioning are somewhat symmetrical. It is not immediately clear 
to me that one is better than the other.
So in summary the design choices I see are: a) whether a single hash table or a 
collection of hash tables for disjoint hash ranges should represent the 
aggregation results (my preference is right now on the collection, which seems 
to give more flexibility in the future), b) whether to use merge based or 
partition based approach. 

> [C++][Compute] Add support for imperfect grouping for use in radix 
> partitioning
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-12044
>                 URL: https://issues.apache.org/jira/browse/ARROW-12044
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>              Labels: query-engine
>             Fix For: 6.0.0
>
>
> ARROW-11591 adds Grouper for identifying groups based on multiple key columns.
> For a large number of groups, it is beneficial to do a first pass 
> partitioning on the key columns so that each worker thread only handles a 
> subset of the query's groups. This is usually accomplished by computing only 
> the hashes of the keys (not full group identity) and pushing slices of the 
> input batches to workers based on those.
> This would probably make sense as a member function of Grouper, maybe 
> Grouper::ConsumeImperfect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to