Youwei Wang has posted comments on this change. Change subject: IMPALA-2809: Improve ByteSwap with builtin function or SSE or AVX2. ......................................................................
Patch Set 13: (3 comments) http://gerrit.cloudera.org:8080/#/c/3081/13/be/src/benchmarks/bswap-benchmark.cc File be/src/benchmarks/bswap-benchmark.cc: Line 50: // SIMD 57.73 2.18X > I think we should pause this code review while you figure out why SIMD is s Hi Jim. I have re-written this benchmark. Please review the latest patch for more information. And if you don't mind, I want to present some performance data first: /// ByteSwap benchmark: Function Rate (iters/ms) Comparison /// ---------------------------------------------------------------------- /// OldImpala 1.224e+06 1X /// FastScala 1.286e+06 1.051X /// SSE4.2 1.296e+07 10.59X /// AVX2 3.156e+07 25.78X /// SIMD 3.043e+07 24.86X This performance data comes from my latest benchmark. For simplicity, SSE4.2 refers to the performance test result of the ByteSwapSSE_Unit function in the bit-util.inline.h. AVX2 refers to the performance test result of the ByteSwapAVX_Unit function in the bit-util.inline.h. The SIMD refers to the "inline void BitUtil::ByteSwap(void* dest, const void* source, int len) ;" interface in the bit-util.inline.h. As you can see here, the performance data is explainable now: 1. FastScala vs. OldImpala: this result is pretty easy to understand. 2. AVX2 vs. SSE4.2: this result is also easy to understand since AVX2 has double data-width than SSE4.2. We can expect such great performance boost here. 3. SIMD vs. AVX2: the arch-selector branch indeed causes some relatively-small performance loss. But I think it is still acceptable according to the context. And please allow me to give some explanation here: 1. Zuo Wang's idea to byteswap a long array one element by one element. The data type of maximum size is 128bit, that is, the Decimal16 type. So here is some illustration: D16: the abbreviation for Decimial16. BS: the abbreviation for Byteswap. Source: D16[0] D16[1] D16[2] D16[3] D16[4] D16[5] D16[6] D16[7] .... Manipulation: AVX2-BS(D16[0] D16[1]) AVX2-BS(D16[2] D16[3]) AVX2-BS(D16[4] D16[5]) AVX2-BS(D16[6] D16[7]) Result: BS(D16[0]) BS(D16[1]) BS(D16[2]) BS(D16[3]) BS(D16[4]) BS(D16[5]) BS(D16[6]) BS(D16[7]) 2. In contrast, my logic is to deal with a very long array like this: Source: Byte1 Byte2 Byte3 Byte4 ... ByteN-1 ByteN Manipulation: AVX2-BS(Byte1 Byte2 Byte3 Byte4 ... ByteN-1 ByteN) Result: ByteN ByteN-1 ... Byte4 Byte3 Byte2 Byte1 My logic is based on my understanding of this interface: inline void BitUtil::ByteSwap(void* dest, const void* source, int len) ; This interface is just to swap an input array without taking its types into consideration. I am not sure whether my understanding of the itnterface BitUtil::ByteSwap is right or not. If you approve my logic, should we continue on my latest code since my new benchmark can output reasonable performance data now? Please feel free to share your idea about this. Line 250: /// For example, if FIXED_LEN_SIZE == 2, we are going to deal with the first > I think it would be inappropriate to check this code in if Zuo Wang is the Done Line 255: /// we can get these constants by following definitions of corresponding AVX2 intrinsics: > This is way too sparse. Please explain in more detail. Consider using one e This part of code is removed in the latest patch. -- To view, visit http://gerrit.cloudera.org:8080/3081 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I392ed5a8d5683f30f161282c228c1aedd7b648c1 Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Youwei Wang <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Jim Apple <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Youwei Wang <[email protected]> Gerrit-HasComments: Yes
