[
https://issues.apache.org/jira/browse/IMPALA-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong resolved IMPALA-2055.
-----------------------------------
Resolution: Later
> improve performance of HyperLogLog
> ----------------------------------
>
> Key: IMPALA-2055
> URL: https://issues.apache.org/jira/browse/IMPALA-2055
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 2.2
> Reporter: Daniel Lescohier
> Priority: Minor
> Attachments: hllmerge.cxx, hllmerge_clang.s, hllmerge_gcc.s
>
>
> I'm attaching a standalone .cxx file that can be compiled independently, to
> showcase improvements to the HLL functions in
> be/src/exprs/aggregate-functions.cc.
> The two assembler files are output from:
> - clang -S -O3 hllmerge.cxx
> - gcc -S -O3 hllmerge.cxx
> The improvements can be integrated in:
> - AggregateFunctions::HllMerge
> - AggregateFunctions::HllFinalEstimate
> I think HllMerge is probably more important, because a lot more data is put
> through that function.
> For HllMerge, I was able to have the compilers vectorize the code with SSE
> packed byte instructions. In order for the compiler to do it, I changed the
> function so that:
> - The loop limit is against the compile-time constant HLL_LEN. The HllMerge
> function has these checks, so it'd be OK to use the constant:
> -- DCHECK_EQ(dst->len, HLL_LEN);
> -- DCHECK_EQ(src.len, HLL_LEN);
> - Made the loop into a simpler form that the compilers could recognize to
> vectorize.
> The C++ code:
> {code}
> void HllMerge(FunctionContext* ctx, const StringVal& src,
> StringVal *dst)
> {
> const char *s = src.ptr, *end = src.ptr + HLL_LEN;
> char *d = dst->ptr;
> for (; s < end; ++s, ++d) {
> *d = (*s > *d) ? *s: *d;
> }
> }
> {code}
> The inner loop of the merge that clang generates is this:
> {code}
> .LBB0_9: # %vector.body
> # =>This Inner Loop Header: Depth=1
> movdqu -16(%rsi), %xmm0
> movdqu -16(%rdx), %xmm1
> movdqa %xmm0, %xmm2
> pcmpgtb %xmm1, %xmm2
> pand %xmm2, %xmm0
> pandn %xmm1, %xmm2
> por %xmm0, %xmm2
> movdqu %xmm2, -16(%rdx)
> movdqu (%rsi), %xmm0
> movdqu (%rdx), %xmm1
> movdqa %xmm0, %xmm2
> pcmpgtb %xmm1, %xmm2
> pand %xmm2, %xmm0
> pandn %xmm1, %xmm2
> por %xmm0, %xmm2
> movdqu %xmm2, (%rdx)
> addq $32, %rsi
> addq $32, %rdx
> addq $-32, %rcx
> jne .LBB0_9
> {code}
> For HllEstimate, I made two helper functions, which can be static functions
> in aggregate-functions.cc.
> One to count the number of zero registers:
> {code}
> int HllCountZeroRegisters(const char *buckets)
> {
> int sum;
> for (int i = 0; i < HLL_LEN; ++i) {
> sum += buckets[i] ? 0 : 1;
> }
> return sum;
> }
> {code}
> This also uses packed byte instructions.
> The second helper function is to sum the inverse powers of 2. This uses an
> array of 64 floats that has the precomputed inverse powers of 2, to save
> calling the expensive powf function. The table is 256 bytes, so easily fits
> in L1 cache.
> {code}
> float HllSumInvPow2(const char *buckets)
> {
> float sum = 0;
> const uint64_t *i = reinterpret_cast<const uint64_t *>(buckets);
> const uint64_t *end = i + (HLL_LEN/sizeof(end));
> for (; i < end; ++i) {
> sum += invPow2[(*i >> 0) & 63];
> sum += invPow2[(*i >> 8) & 63];
> sum += invPow2[(*i >> 16) & 63];
> sum += invPow2[(*i >> 24) & 63];
> sum += invPow2[(*i >> 32) & 63];
> sum += invPow2[(*i >> 40) & 63];
> sum += invPow2[(*i >> 48) & 63];
> sum += invPow2[(*i >> 56) & 63];
> }
> return sum;
> }
> {code}
> In order to use this function, we should add this check to HllFinalEstimate:
> - DCHECK_EQ(HLL_LEN & 7, 0)
> Reading from L1 cache should be much faster than calling the powf function.
> The movq/shrq/andl instructions are basically free since they're interleaved
> with the latency of the addss instruction adding from L1 cache.
> I recommend that HllFinalEstimate calls the HllCountZeroRegisters function
> first, and only if num_zero_registers == 0, call the HllSumInvPow2 function.
> No need to do the relatively expensive floating point sum if the number of
> distinct values is low.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)