[
https://issues.apache.org/jira/browse/ARROW-12044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393473#comment-17393473
]
Michal Nowakiewicz edited comment on ARROW-12044 at 8/4/21, 10:38 PM:
----------------------------------------------------------------------
The description above suggests that this can potentially be solved at a level
above Grouper using a computed field representing target partition id. A note
regarding partitioning and group by: it is important to not use the same hash
bits coming from the same hash function for both operations. Otherwise the
number of hash conflicts in the hash table can grow big.
This issue is pointing out to an existing problem that needs to be solved.
Currently Grouper has been designed and works only for up to around 10 million
groups or for up to 4GB of data, whichever limit is hit first.
There are multiple ways how Grouper can be extended to work with arbitrary
number of groups up to the limit of available RAM. One solution is to just
change current hash table to use 64-bit offset / hashes / group ids. From the
experiments I did a few months ago this is not the most efficient way of doing
group by aggregation for large hash tables - solutions based on partitioning of
input data prior to grouping seem to perform better for >4GB hash tables. The
performance bottleneck here is related to cache misses, including costly TLB
cache misses, that happen on almost every lookup if the input has somewhat
uniformly distributed keys.
The other two approaches, which seem appealing to me, are based on the same
idea of representing a large hash table as a collection of smaller hash tables.
Each smaller hash table in the collection stores data for a distinct range of
hashes and the sum of all the ranges covers the entire set of possible hash
values. The size of the hash range for smaller hash tables can potentially vary
from one to another within a collection.
One approach is to use a sequence of merges. Once a small hash table gets full,
it is appended to the list of merge inputs, and a fresh empty hash table
replaces it for subsequent processing of input exec batches. Fragments of merge
input hash tables get merged together to form a new hash table that represents
combined results. Fragments that are merged may represent a subset of hashes
from input hash tables. A merge of multiple hash tables with hashes from 0 to
1023 may result for instance in a pair of hash tables representing combined
grouped aggregation results, one for hashes from 0 to 511 and the other from
512 to 1023.
Second approach is to use partitioning of data. The input keys either before or
after initial aggregation are put into buckets corresponding to partitions.
Each partition covers a distinct range of hash values but the number of
partitions and size of the range are fixed. Then grouped aggregation would be
done separately within each partition using a target hash table for that
partition. Recursive partitioning or dynamic adjustments to the number of
partitions may be used if the final size of the data cannot be correctly
estimated.
Merging and partitioning are somewhat symmetrical. It is not immediately clear
to me that one is better than the other.
So in summary the design choices I see are: a) whether a single hash table or
a collection of hash tables for disjoint hash ranges should represent the
aggregation results (my preference is right now on the collection, which seems
to give more flexibility in the future), b) whether to use merge based or
partition based approach (or neither).
was (Author: michalno):
The description above suggests that this can potentially be solved at a level
above Grouper using a computed field representing target partition id. A note
regarding partitioning and group by: it is important to not use the same hash
bits coming from the same hash function for both operations. Otherwise the
number of hash conflicts in the hash table can grow big.
This issue is pointing out to an existing problem that needs to be solved.
Currently Grouper has been designed and works only for up to around 10 million
groups or for up to 4GB of data, whichever limit is hit first.
There are multiple ways how Grouper can be extended to work with arbitrary
number of groups up to the limit of available RAM. One solution is to just
change current hash table to use 64-bit offset / hashes / group ids. From the
experiments I did a few months ago this is not the most efficient way of doing
group by aggregation for large hash tables - solutions based on partitioning of
input data prior to grouping seem to perform better for >4GB hash tables. The
performance bottleneck here is related to cache misses, including costly TLB
cache misses, that happen on almost every lookup if the input has somewhat
uniformly distributed keys.
The other two approaches, which seem appealing to me, are based on the same
idea of representing a large hash table as a collection of smaller hash tables.
Each smaller hash table in the collection stores data for a distinct range of
hashes and the sum of all the ranges covers the entire set of possible hash
values. The size of the hash range for smaller hash tables can potentially vary
from one to another within a collection.
One approach is to use a sequence of merges. Once a small hash table gets full,
it is appended to the list of merge inputs, and a fresh empty hash table
replaces it for subsequent processing of input exec batches. Fragments of merge
input hash tables get merged together to form a new hash table that represents
combined results. Fragments that are merged may represent a subset of hashes
from input hash tables. A merge of multiple hash tables with hashes from 0 to
1023 may result for instance in a pair of hash tables representing combined
grouped aggregation results, one for hashes from 0 to 511 and the other from
512 to 1023.
Second approach is to use partitioning of data. The input keys either before or
after initial aggregation are put into buckets corresponding to partitions.
Each partition covers a distinct range of hash values but the number of
partitions and size of the range are fixed. Then grouped aggregation would be
done separately within each partition using a target hash table for that
partition. Recursive partitioning or dynamic adjustments to the number of
partitions may be used if the final size of the data cannot be correctly
estimated.
Merging and partitioning are somewhat symmetrical. It is not immediately clear
to me that one is better than the other.
So in summary the design choices I see are: a) whether a single hash table or a
collection of hash tables for disjoint hash ranges should represent the
aggregation results (my preference is right now on the collection, which seems
to give more flexibility in the future), b) whether to use merge based or
partition based approach.
> [C++][Compute] Add support for imperfect grouping for use in radix
> partitioning
> -------------------------------------------------------------------------------
>
> Key: ARROW-12044
> URL: https://issues.apache.org/jira/browse/ARROW-12044
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Ben Kietzman
> Priority: Major
> Labels: query-engine
> Fix For: 6.0.0
>
>
> ARROW-11591 adds Grouper for identifying groups based on multiple key columns.
> For a large number of groups, it is beneficial to do a first pass
> partitioning on the key columns so that each worker thread only handles a
> subset of the query's groups. This is usually accomplished by computing only
> the hashes of the keys (not full group identity) and pushing slices of the
> input batches to workers based on those.
> This would probably make sense as a member function of Grouper, maybe
> Grouper::ConsumeImperfect
--
This message was sent by Atlassian Jira
(v8.3.4#803005)