[GitHub] [orc] wpleonardo opened a new pull request, #1375: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

GitBox Tue, 10 Jan 2023 22:50:24 -0800


wpleonardo opened a new pull request, #1375:
URL: https://github.com/apache/orc/pull/1375


   ### What changes were proposed in this pull request?
   In the original ORC Rle-bit-packing, it decodes value one by one, and Intel 
AVX-512 brings the capabilities of 512-bit vector operations to accelerate the 
Rle-bit-packing decode process. We only need execute much less CPU instructions 
to unpacking more data than usual. So the performance of AVX-512 vector decode 
is much better than before. In the funcational micro-performance test I suppose 
AVX-512 vector decode could bring average 6X ~ 7X performance latency 
improvement compare vector function unrolledUnpackVectorN with the original 
Rle-bit-packing decode function plainUnpackLongs. In the real world, user will 
store large data with ORC data format, and need to decoding hundreds or 
thousands of bytes, AVX-512 vector decode will be more efficient and help to 
improve this processing.
   
   In the real world, the data size in ORC will be less than 32bit as usual. So 
I supplied the vector code transform about the data value size less than 32bits 
in this PR. To the data value is 8bit, 16bit or other 8x bit size, the 
performance improvement will be relatively small compared with other not 8x bit 
size value.
   
   Intel AVX512 instructions official link:
   https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
   
   1. Added cmake option named "ENABLE_AVX512_BIT_PACKING", to switch this 
feature enable or not in the building process.
   The default value of ENABLE_AVX512_BIT_PACKING is OFF.
   For example, cmake .. -DCMAKE_CXX_FLAGS="-mavx512vbmi -march=native" 
-DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DENABLE_AVX512_BIT_PACKING=ON 
-DSNAPPY_HOME=/usr/local
   2. Added macro "ENABLE_AVX512" to enable this feature code build or not in 
ORC.
   3. Added the function "detect_platform" to dynamicly detect the current 
platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, 
and the current platform ORC running on doesn't support AVX-512, it will use 
the original bit-packing decode function instead of AVX-512 vector decode.
   4. Added the functions "unrolledUnpackVectorN" to support N-bit value decode 
instead of the original function plainUnpackLongs or unrolledUnpackN
   5. Added the testcases "RleV2_basic_vector_decode_Nbit" to verify N-bit 
value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
   6. Modified the function plainUnpackLongs, added an output parameter 
uint64_t& startBit. This parameter used to store the left bit number after 
unpacking.
   7. AVX-512 vector decode process 512 bits data in every data unpacking. So 
if the current unpacking data length is long enough, almost all of the data can 
be processed by AVX-512. But if the data length (or block size) is too short, 
less than 512 bits, it will not use AVX-512 to do unpacking work. It will back 
to the original decode way to do unpacking one by one.
   
   Add new files:
   <html xmlns:v="urn:schemas-microsoft-com:vml"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   xmlns:x="urn:schemas-microsoft-com:office:excel"
   xmlns="http://www.w3.org/TR/REC-html40";>
   
   <head>
   
   <meta name=ProgId content=Excel.Sheet>
   <meta name=Generator content="Microsoft Excel 15">
   <link id=Main-File rel=Main-File
   href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
   <link rel=File-List
   
href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
   <style>
   <!--table
        {mso-displayed-decimal-separator:"\.";
        mso-displayed-thousand-separator:"\,";}
   @page
        {margin:.75in .7in .75in .7in;
        mso-header-margin:.3in;
        mso-footer-margin:.3in;}
   tr
        {mso-height-source:auto;}
   col
        {mso-width-source:auto;}
   br
        {mso-data-placement:same-cell;}
   td
        {padding-top:1px;
        padding-right:1px;
        padding-left:1px;
        mso-ignore:padding;
        color:black;
        font-size:11.0pt;
        font-weight:400;
        font-style:normal;
        text-decoration:none;
        font-family:Calibri, sans-serif;
        mso-font-charset:0;
        mso-number-format:General;
        text-align:general;
        vertical-align:bottom;
        border:none;
        mso-background-source:auto;
        mso-pattern:auto;
        mso-protection:locked visible;
        white-space:nowrap;
        mso-rotate:0;}
   .xl65
        {text-align:left;
        vertical-align:middle;}
   .xl66
        {text-align:left;
        vertical-align:middle;
        white-space:normal;}
   -->
   </style>
   </head>
   
   <body link="#0563C1" vlink="#954F72">
   
   
   
   New   Files | File Purpose
   -- | --
   DetectPlatform.hh | Dynamically detect the current   platform supports       
AVX-512 or not. If yes, will use AVX-512 vector decode,     if not, will still 
the original decode functions.
   VectorDecoder.hh | This file contains the new   macros, arrays, and unions   
  which AVX-512 vector decode needs.
   TestRleVectorDecoder.cc | New testcases to do unit and   funcational test    
 about this new feature
   
   
   
   </body>
   
   </html>
   
   
   
   ### Why are the changes needed?
   This can improve the performance of Rle-bit-packing decode. In the 
funcational micro-performance test I suppose AVX-512 vector decode could bring 
average 6X ~ 7X performance latency improvement compare vector function 
unrolledUnpackVectorN with the original Rle-bit-packing decode function 
plainUnpackLongs.
   As Intel gradually improves CPU performance every year and users do data 
analyzation based ORC data format on the newer platform. 6 years ago, on Intel 
SKX platform it already support AVX512 instructions. So we need to upgrade ORC 
data unpacking according to the popular feature of CPU, this will keep ORC pace 
with the times. 
   
   
   ### How was this patch tested?
   I created a new testcase file TestRleVectorDecoder.cc. It contains the below 
testcases, we can open cmake option -DENABLE_AVX512_BIT_PACKING=ON and running 
these testcases on the platform support AVX-512. Every testcase contain 2 
scenarios:
   1. The blockSize increases from 1 to 10000, and data length is 10240;
   2. The blockSize increases from 1000 to 10000, and data length increases 
from 1000 to 70000
   The testcase will be executed for a while, so I added a progress bar for 
every testcase.
   Here is a progress bar demo print of one testcase:
   [ RUN      ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1
   10bit Test 1st Part:100% 
[##################################################] [10000/10000]
   10bit Test 2nd Part:100% 
[##################################################] [10000/10000]
   To the main vector function unrolledUnpackVectorN, the test code coverage 
upto 100%.
   
   
   <html xmlns:v="urn:schemas-microsoft-com:vml"
   xmlns:o="urn:schemas-microsoft-com:office:office"
   xmlns:x="urn:schemas-microsoft-com:office:excel"
   xmlns="http://www.w3.org/TR/REC-html40";>
   
   <head>
   
   <meta name=ProgId content=Excel.Sheet>
   <meta name=Generator content="Microsoft Excel 15">
   <link id=Main-File rel=Main-File
   href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
   <link rel=File-List
   
href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
   <style>
   <!--table
        {mso-displayed-decimal-separator:"\.";
        mso-displayed-thousand-separator:"\,";}
   @page
        {margin:.75in .7in .75in .7in;
        mso-header-margin:.3in;
        mso-footer-margin:.3in;}
   tr
        {mso-height-source:auto;}
   col
        {mso-width-source:auto;}
   br
        {mso-data-placement:same-cell;}
   td
        {padding-top:1px;
        padding-right:1px;
        padding-left:1px;
        mso-ignore:padding;
        color:black;
        font-size:11.0pt;
        font-weight:400;
        font-style:normal;
        text-decoration:none;
        font-family:Calibri, sans-serif;
        mso-font-charset:0;
        mso-number-format:General;
        text-align:general;
        vertical-align:bottom;
        border:none;
        mso-background-source:auto;
        mso-pattern:auto;
        mso-protection:locked visible;
        white-space:nowrap;
        mso-rotate:0;}
   .xl65
        {text-align:center;
        vertical-align:middle;}
   -->
   </style>
   </head>
   
   <body link="#0563C1" vlink="#954F72">
   
   
   
   New Testcases | Test Data Bit Size
   -- | --
   RleV2_basic_vector_decode_1bit | 1bit
   RleV2_basic_vector_decode_2bit | 2bit
   RleV2_basic_vector_decode_3bit | 3bit
   RleV2_basic_vector_decode_4bit | 4bit
   RleV2_basic_vector_decode_5bit | 5bit
   RleV2_basic_vector_decode_6bit | 6bit
   RleV2_basic_vector_decode_7bit | 7bit
   RleV2_basic_vector_decode_9bit | 9bit
   RleV2_basic_vector_decode_10bit | 10bit
   RleV2_basic_vector_decode_11bit | 11bit
   RleV2_basic_vector_decode_12bit | 12bit
   RleV2_basic_vector_decode_13bit | 13bit
   RleV2_basic_vector_decode_14bit | 14bit
   RleV2_basic_vector_decode_15bit | 15bit
   RleV2_basic_vector_decode_16bit | 16bit
   RleV2_basic_vector_decode_17bit | 17bit
   RleV2_basic_vector_decode_18bit | 18bit
   RleV2_basic_vector_decode_19bit | 19bit
   RleV2_basic_vector_decode_20bit | 20bit
   RleV2_basic_vector_decode_21bit | 21bit
   RleV2_basic_vector_decode_22bit | 22bit
   RleV2_basic_vector_decode_23bit | 23bit
   RleV2_basic_vector_decode_24bit | 24bit
   RleV2_basic_vector_decode_26bit | 26bit
   RleV2_basic_vector_decode_28bit | 28bit
   RleV2_basic_vector_decode_30bit | 30bit
   RleV2_basic_vector_decode_32bit | 32bit
   
   
   
   </body>
   
   </html>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] wpleonardo opened a new pull request, #1375: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode

Reply via email to