[jira] [Updated] (ORC-1356) Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Gang Wu (Jira) Sun, 07 May 2023 07:21:06 -0700


     [ 
https://issues.apache.org/jira/browse/ORC-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gang Wu updated ORC-1356:
-------------------------
    Fix Version/s: 1.9.0

> Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode
> -----------------------------------------------------------------------
>
>                 Key: ORC-1356
>                 URL: https://issues.apache.org/jira/browse/ORC-1356
>             Project: ORC
>          Issue Type: Improvement
>          Components: C++, ORCv2, RLE
>    Affects Versions: master
>            Reporter: Peng Wang
>            Assignee: Peng Wang
>            Priority: Major
>             Fix For: 1.9.0
>
>
> In the original ORC Rle-bit-packing, it decodes value one by one, and Intel 
> AVX-512 brings the capabilities of 512-bit vector operations to accelerate 
> the Rle-bit-packing decode process. We only need execute much less CPU 
> instructions to unpacking more data than usual. So the performance of AVX-512 
> vector decode is much better than before. In the funcational 
> micro-performance test I suppose AVX-512 vector decode could bring average 6X 
> ~ 7X performance latency improvement compare vector function 
> unrolledUnpackVectorN with the original Rle-bit-packing decode function 
> plainUnpackLongs. In the real world, user will store large data with ORC data 
> format, and need to decoding hundreds or thousands of bytes, AVX-512 vector 
> decode will be more efficient and help to improve this processing.
>  
> In the real world, the data size in ORC will be less than 32bit as usual. So 
> I supplied the vector code transform about the data value size less than 
> 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the 
> performance improvement will be relatively small compared with other not 8x 
> bit size value.
>  
> Intel AVX512 instructions official link:
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
>  
> 1. Added cmake option named "ENABLE_AVX512_BIT_PACKING", to switch this 
> feature enable or not in the building process.
> The default value of ENABLE_AVX512_BIT_PACKING is OFF.
> For example, cmake .. -DCMAKE_CXX_FLAGS="-mavx512vbmi -march=native" 
> -DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DENABLE_AVX512_BIT_PACKING=ON 
> -DSNAPPY_HOME=/usr/local
> 2. Added macro "ENABLE_AVX512" to enable this feature code build or not in 
> ORC.
> 3. Added the function "detect_platform" to dynamicly detect the current 
> platform supports AVX-512 or not. When customers build ORC with AVX-512 
> enable, and the current platform ORC running on doesn't support AVX-512, it 
> will use the original bit-packing decode function instead of AVX-512 vector 
> decode.
> 4. Added the functions "unrolledUnpackVectorN" to support N-bit value decode 
> instead of the original function plainUnpackLongs or unrolledUnpackN
> 5. Added the testcases "RleV2_basic_vector_decode_Nbit" to verify N-bit value 
> AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
> 6. Modified the function plainUnpackLongs, added an output parameter 
> uint64_t& startBit. This parameter used to store the left bit number after 
> unpacking.
> 7. AVX-512 vector decode process 512 bits data in every data unpacking. So if 
> the current unpacking data length is long enough, almost all of the data can 
> be processed by AVX-512. But if the data length (or block size) is too short, 
> less than 512 bits, it will not use AVX-512 to do unpacking work. It will 
> back to the original decode way to do unpacking one by one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ORC-1356) Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Reply via email to