Dan Hecht has posted comments on this change. Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%. ......................................................................
Patch Set 5: > > It just occurred to me that this could give incorrect results > > running on a mixed cluster of avx2/non-avx2 machines. > > > > Would it make sense to just use the avx2-optimised layout for the > > non-avx2 case? > > Good point, Tim! > > Using the same layout is certainly possible. Using the same hash > functions, however, would slow down the non-avx2 code. > > The reason is that, between PS4 and PS5, I stated using the vpmulld > instruction to rehash the 32-bit value by multiplying it by 8 > different odd 32-bit constants and taking the top 5 bits of each. > In the serial code, I multiply by two different 64-bit constants > using Rehash32to64, add other 64-bit constants, then take the top > 32-bits of each of each. Switching to eight 32-bit multiplications > would be a good bit slower, I suspect. > > This could be alleviated using pmulld, which can perform 4 32-bit > multiplications with one instruction, but that was added in SSE4.1. > > I see two options: > > 1. Leave some performance on the table with this commit by moving > back to PS4. > > 2. Take a regression for pre-sse4.1 machines (ended in 2008ish for > Intel, 2012ish for AMD, if I'm reading correctly) and a bigger > speedup for more modern machines. > > I have another change I've already started testing that increases > the gap between #1 and #2 by another 50-100%. > > Tim, Dan: what do you think is the right choice? How much of a regression will pre-SSE4.1 incur? Regressing that case in favor of making SSE4.1 capable > > It just occurred to me that this could give incorrect results > > running on a mixed cluster of avx2/non-avx2 machines. > > > > Would it make sense to just use the avx2-optimised layout for the > > non-avx2 case? > > Good point, Tim! > > Using the same layout is certainly possible. Using the same hash > functions, however, would slow down the non-avx2 code. > > The reason is that, between PS4 and PS5, I stated using the vpmulld > instruction to rehash the 32-bit value by multiplying it by 8 > different odd 32-bit constants and taking the top 5 bits of each. > In the serial code, I multiply by two different 64-bit constants > using Rehash32to64, add other 64-bit constants, then take the top > 32-bits of each of each. Switching to eight 32-bit multiplications > would be a good bit slower, I suspect. > > This could be alleviated using pmulld, which can perform 4 32-bit > multiplications with one instruction, but that was added in SSE4.1. > > I see two options: > > 1. Leave some performance on the table with this commit by moving > back to PS4. > > 2. Take a regression for pre-sse4.1 machines (ended in 2008ish for > Intel, 2012ish for AMD, if I'm reading correctly) and a bigger > speedup for more modern machines. > > I have another change I've already started testing that increases > the gap between #1 and #2 by another 50-100%. > > Tim, Dan: what do you think is the right choice? For #2, what is the rough perf impact for pre-SSE4.1 and SSE4.1 before and after that change? #2 is probably the right choice, but would be good to have some rough estimates to understand the implications. -- To view, visit http://gerrit.cloudera.org:8080/3338 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Jim Apple <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: No
