[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Dan Hecht (Code Review) Mon, 13 Jun 2016 14:30:41 -0700

Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................



Patch Set 5:

> > It just occurred to me that this could give incorrect results
 > > running on a mixed cluster of avx2/non-avx2 machines.
 > >
 > > Would it make sense to just use the avx2-optimised layout for the
 > > non-avx2 case?
 > 
 > Good point, Tim!
 > 
 > Using the same layout is certainly possible. Using the same hash
 > functions, however, would slow down the non-avx2 code.
 > 
 > The reason is that, between PS4 and PS5, I stated using the vpmulld
 > instruction to rehash the 32-bit value by multiplying it by 8
 > different odd 32-bit constants and taking the top 5 bits of each.
 > In the serial code, I multiply by two different 64-bit constants
 > using Rehash32to64, add other 64-bit constants, then take the top
 > 32-bits of each of each. Switching to eight 32-bit multiplications
 > would be a good bit slower, I suspect.
 > 
 > This could be alleviated using pmulld, which can perform 4 32-bit
 > multiplications with one instruction, but that was added in SSE4.1.
 > 
 > I see two options:
 > 
 > 1. Leave some performance on the table with this commit by moving
 > back to PS4.
 > 
 > 2. Take a regression for pre-sse4.1 machines (ended in 2008ish for
 > Intel, 2012ish for AMD, if I'm reading correctly) and a bigger
 > speedup for more modern machines.
 > 
 > I have another change I've already started testing that increases
 > the gap between #1 and #2 by another 50-100%.
 > 
 > Tim, Dan: what do you think is the right choice?

How much of a regression will pre-SSE4.1 incur?  Regressing that case in favor 
of making SSE4.1 capable 

 > > It just occurred to me that this could give incorrect results
 > > running on a mixed cluster of avx2/non-avx2 machines.
 > >
 > > Would it make sense to just use the avx2-optimised layout for the
 > > non-avx2 case?
 > 
 > Good point, Tim!
 > 
 > Using the same layout is certainly possible. Using the same hash
 > functions, however, would slow down the non-avx2 code.
 > 
 > The reason is that, between PS4 and PS5, I stated using the vpmulld
 > instruction to rehash the 32-bit value by multiplying it by 8
 > different odd 32-bit constants and taking the top 5 bits of each.
 > In the serial code, I multiply by two different 64-bit constants
 > using Rehash32to64, add other 64-bit constants, then take the top
 > 32-bits of each of each. Switching to eight 32-bit multiplications
 > would be a good bit slower, I suspect.
 > 
 > This could be alleviated using pmulld, which can perform 4 32-bit
 > multiplications with one instruction, but that was added in SSE4.1.
 > 
 > I see two options:
 > 
 > 1. Leave some performance on the table with this commit by moving
 > back to PS4.
 > 
 > 2. Take a regression for pre-sse4.1 machines (ended in 2008ish for
 > Intel, 2012ish for AMD, if I'm reading correctly) and a bigger
 > speedup for more modern machines.
 > 
 > I have another change I've already started testing that increases
 > the gap between #1 and #2 by another 50-100%.
 > 
 > Tim, Dan: what do you think is the right choice?

For #2, what is the rough perf impact for pre-SSE4.1 and SSE4.1 before and 
after that change?  #2 is probably the right choice, but would be good to have 
some rough estimates to understand the implications.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Reply via email to