[
https://issues.apache.org/jira/browse/IMPALA-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362899#comment-17362899
]
Amogh Margoor commented on IMPALA-7635:
---------------------------------------
Using the benchmark here:
[https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2016/06/25/fastrange.c]
computed the cycles needed to execute modulo (*modsum*), fast modulo using
multiplication (*fastsum*) and using existing technique for bit masking
(*maskedsum*) below:
// N is number of buckets
*N = 31*
*modsum(z,N,accesses,nmbr): 11.96 cycles per operation*
*fastsum(z,N,accesses,nmbr): 2.25 cycles per operation*
*N = 1500*
*modsum(z,N,accesses,nmbr): 11.96 cycles per operation*
*fastsum(z,N,accesses,nmbr): 2.27 cycles per operation*
*N = 15000*
*modsum(z,N,accesses,nmbr): 11.96 cycles per operation*
*fastsum(z,N,accesses,nmbr): 2.27 cycles per operation*
*N = 32*
*modsum(z,N,accesses,nmbr): 11.96 cycles per operation*
*fastsum(z,N,accesses,nmbr): 2.27 cycles per operation*
*maskedsum(z,N,accesses,nmbr): 2.02 cycles per operation*
*N = 4096*
*modsum(z,N,accesses,nmbr): 11.96 cycles per operation*
*fastsum(z,N,accesses,nmbr): 2.27 cycles per operation*
*maskedsum(z,N,accesses,nmbr): 2.02 cycles per operation*
*N = 65536*
*modsum(z,N,accesses,nmbr): 11.96 cycles per operation*
*fastsum(z,N,accesses,nmbr): 2.31 cycles per operation*
*maskedsum(z,N,accesses,nmbr): 2.16 cycles per operation*
Not much difference is observed between the existing technique of masked bit
and fast modulo using multiplication. Hence we can use fast modulo which will
enable us to use number of buckets not power of 2 (which is the constraint for
using bit masking modulo).
> Reduce size of hash tables in-memory by packing buckets more densely
> --------------------------------------------------------------------
>
> Key: IMPALA-7635
> URL: https://issues.apache.org/jira/browse/IMPALA-7635
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 3.1.0
> Reporter: Tim Armstrong
> Assignee: Amogh Margoor
> Priority: Major
> Labels: perf
>
> Currently the hash tables used for hash join and aggregation use 16 bytes per
> bucket and 24 bytes per additional duplicate for a key:
> {code}
> /// Linked list of entries used for duplicates.
> struct DuplicateNode {
> /// Used for full outer and right {outer, anti, semi} joins. Indicates
> whether the
> /// row in the DuplicateNode has been matched.
> /// From an abstraction point of view, this is an awkward place to store
> this
> /// information.
> /// TODO: Fold this flag in the next pointer below.
> bool matched;
> /// Chain to next duplicate node, NULL when end of list.
> DuplicateNode* next;
> HtData htdata;
> };
> struct Bucket {
> /// Whether this bucket contains a vaild entry, or it is empty.
> bool filled;
> /// Used for full outer and right {outer, anti, semi} joins. Indicates
> whether the
> /// row in the bucket has been matched.
> /// From an abstraction point of view, this is an awkward place to store
> this
> /// information but it is efficient. This space is otherwise unused.
> bool matched;
> /// Used in case of duplicates. If true, then the bucketData union should
> be used as
> /// 'duplicates'.
> bool hasDuplicates;
> /// Cache of the hash for data.
> /// TODO: Do we even have to cache the hash value?
> uint32_t hash;
> /// Either the data for this bucket or the linked list of duplicates.
> union {
> HtData htdata;
> DuplicateNode* duplicates;
> } bucketData;
> };
> {code}
> There are some comments in the code that suggest folding the boolean values
> into the upper bits of the pointers (since on amd64 the address space is only
> 48 bits, but moving to 57 bits apparently - see
> https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf).
> That would reduce the bucket to 12 bytes of actual data.
> This would give us the opportunity to reduce memory requirements of joins and
> the pressure on caches significantly, provided we can work out the
> implementation issues and the cost of the bit manipulation doesn't exceed the
> benefit (my intuition is that cache effects are way more important but I
> could be wrong).
> Here's a rough idea of what we could do:
> # Implement folding of booleans into the pointer and mark struct Bucket as
> packed so that it doesn't just undo the work with additional padding.
> # Modifying Hashtable to work with the new bucket structure. This needs a
> little thought since the bucket allocations must be a power-of-two size in
> bytes, but we also need the hash table entries to be a power-of-two in order
> for masking the hash to get the bucket number to work. I think either we
> could just leave wasted space in the buffer or switch to a non-power-of-two
> number of buckets and using an alternative method of getting the bucket from
> the hash:
> https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
> # Run benchmarks to see if it's beneficial. The effect probably depends on
> the data set size.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]