[
https://issues.apache.org/jira/browse/ORC-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gang Wu updated ORC-1356:
-------------------------
Fix Version/s: 1.9.0
> Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode
> -----------------------------------------------------------------------
>
> Key: ORC-1356
> URL: https://issues.apache.org/jira/browse/ORC-1356
> Project: ORC
> Issue Type: Improvement
> Components: C++, ORCv2, RLE
> Affects Versions: master
> Reporter: Peng Wang
> Assignee: Peng Wang
> Priority: Major
> Fix For: 1.9.0
>
>
> In the original ORC Rle-bit-packing, it decodes value one by one, and Intel
> AVX-512 brings the capabilities of 512-bit vector operations to accelerate
> the Rle-bit-packing decode process. We only need execute much less CPU
> instructions to unpacking more data than usual. So the performance of AVX-512
> vector decode is much better than before. In the funcational
> micro-performance test I suppose AVX-512 vector decode could bring average 6X
> ~ 7X performance latency improvement compare vector function
> unrolledUnpackVectorN with the original Rle-bit-packing decode function
> plainUnpackLongs. In the real world, user will store large data with ORC data
> format, and need to decoding hundreds or thousands of bytes, AVX-512 vector
> decode will be more efficient and help to improve this processing.
>
> In the real world, the data size in ORC will be less than 32bit as usual. So
> I supplied the vector code transform about the data value size less than
> 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the
> performance improvement will be relatively small compared with other not 8x
> bit size value.
>
> Intel AVX512 instructions official link:
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
>
> 1. Added cmake option named "ENABLE_AVX512_BIT_PACKING", to switch this
> feature enable or not in the building process.
> The default value of ENABLE_AVX512_BIT_PACKING is OFF.
> For example, cmake .. -DCMAKE_CXX_FLAGS="-mavx512vbmi -march=native"
> -DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DENABLE_AVX512_BIT_PACKING=ON
> -DSNAPPY_HOME=/usr/local
> 2. Added macro "ENABLE_AVX512" to enable this feature code build or not in
> ORC.
> 3. Added the function "detect_platform" to dynamicly detect the current
> platform supports AVX-512 or not. When customers build ORC with AVX-512
> enable, and the current platform ORC running on doesn't support AVX-512, it
> will use the original bit-packing decode function instead of AVX-512 vector
> decode.
> 4. Added the functions "unrolledUnpackVectorN" to support N-bit value decode
> instead of the original function plainUnpackLongs or unrolledUnpackN
> 5. Added the testcases "RleV2_basic_vector_decode_Nbit" to verify N-bit value
> AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
> 6. Modified the function plainUnpackLongs, added an output parameter
> uint64_t& startBit. This parameter used to store the left bit number after
> unpacking.
> 7. AVX-512 vector decode process 512 bits data in every data unpacking. So if
> the current unpacking data length is long enough, almost all of the data can
> be processed by AVX-512. But if the data length (or block size) is too short,
> less than 512 bits, it will not use AVX-512 to do unpacking work. It will
> back to the original decode way to do unpacking one by one.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)