[Impala-CR](cdh5-trunk) IMPALA-2809: Improve ByteSwap with builtin function or SSE or AVX2.

Youwei Wang (Code Review) Fri, 27 May 2016 01:55:34 -0700

Youwei Wang has posted comments on this change.

Change subject: IMPALA-2809: Improve ByteSwap with builtin function or SSE or 
AVX2.
......................................................................



Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3081/13/be/src/benchmarks/bswap-benchmark.cc
File be/src/benchmarks/bswap-benchmark.cc:

Line 50: //                           SIMD               57.73               
2.18X
> I think we should pause this code review while you figure out why SIMD is s
Hi Jim. I have re-written this benchmark. 
Please review the latest patch for more information.
And if you don't mind, I want to present some performance data first:
/// ByteSwap benchmark:   Function     Rate (iters/ms)          Comparison      
/// ----------------------------------------------------------------------      
///                      OldImpala           1.224e+06                  1X      
///                      FastScala           1.286e+06              1.051X      
///                         SSE4.2           1.296e+07              10.59X      
///                           AVX2           3.156e+07              25.78X      
///                           SIMD           3.043e+07              24.86X 

This performance data comes from my latest benchmark.
For simplicity, SSE4.2 refers to the performance test result of the 
ByteSwapSSE_Unit function in the bit-util.inline.h. AVX2 refers to the 
performance test result of the ByteSwapAVX_Unit function in the 
bit-util.inline.h. The SIMD refers to the "inline void BitUtil::ByteSwap(void* 
dest, const void* source, int len) ;" interface in the bit-util.inline.h. 

As you can see here, the performance data is explainable now:
1. FastScala vs. OldImpala: this result is pretty easy to understand.
2. AVX2 vs. SSE4.2: this result is also easy to understand since AVX2 has 
double data-width than SSE4.2. We can expect such great performance boost here.
3. SIMD vs. AVX2: the arch-selector branch indeed causes some relatively-small 
performance loss. But I think it is still acceptable according to the context.

And please allow me to give some explanation here:
1. Zuo Wang's idea to byteswap a long array one element by one element. The 
data type of maximum size is 128bit, that is, the Decimal16 type. So here is 
some illustration:

D16: the abbreviation for Decimial16.
BS: the abbreviation for Byteswap.

Source: 
D16[0] D16[1] D16[2] D16[3] D16[4] D16[5] D16[6] D16[7] ....
Manipulation:
AVX2-BS(D16[0] D16[1])  
AVX2-BS(D16[2] D16[3])
AVX2-BS(D16[4] D16[5])
AVX2-BS(D16[6] D16[7])
Result:
BS(D16[0]) BS(D16[1]) BS(D16[2]) BS(D16[3]) BS(D16[4]) BS(D16[5]) BS(D16[6]) 
BS(D16[7])

2. In contrast, my logic is to deal with a very long array like this:
Source:
Byte1 Byte2 Byte3 Byte4 ... ByteN-1 ByteN
Manipulation:
AVX2-BS(Byte1 Byte2 Byte3 Byte4 ... ByteN-1 ByteN)
Result:
ByteN ByteN-1 ... Byte4 Byte3 Byte2 Byte1 

My logic is based on my understanding of this interface:
inline void BitUtil::ByteSwap(void* dest, const void* source, int len) ;
This interface is just to swap an input array without taking its types into 
consideration.
I am not sure whether my understanding of the itnterface BitUtil::ByteSwap is 
right or not.

If you approve my logic, should we continue on my latest code 
since my new benchmark can output reasonable performance data now?
Please feel free to share your idea about this.


Line 250: /// For example, if FIXED_LEN_SIZE == 2, we are going to deal with 
the first
> I think it would be inappropriate to check this code in if Zuo Wang is the 
Done


Line 255: /// we can get these constants by following definitions of 
corresponding AVX2 intrinsics:
> This is way too sparse. Please explain in more detail. Consider using one e
This part of code is removed in the latest patch.


-- 
To view, visit http://gerrit.cloudera.org:8080/3081
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I392ed5a8d5683f30f161282c228c1aedd7b648c1
Gerrit-PatchSet: 13
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Youwei Wang <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Youwei Wang <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2809: Improve ByteSwap with builtin function or SSE or AVX2.

Reply via email to