wpleonardo opened a new pull request, #1375: URL: https://github.com/apache/orc/pull/1375
### What changes were proposed in this pull request? In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function unrolledUnpackVectorN with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing. In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value. Intel AVX512 instructions official link: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html 1. Added cmake option named "ENABLE_AVX512_BIT_PACKING", to switch this feature enable or not in the building process. The default value of ENABLE_AVX512_BIT_PACKING is OFF. For example, cmake .. -DCMAKE_CXX_FLAGS="-mavx512vbmi -march=native" -DCMAKE_BUILD_TYPE=debug -DBUILD_JAVA=OFF -DENABLE_AVX512_BIT_PACKING=ON -DSNAPPY_HOME=/usr/local 2. Added macro "ENABLE_AVX512" to enable this feature code build or not in ORC. 3. Added the function "detect_platform" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode. 4. Added the functions "unrolledUnpackVectorN" to support N-bit value decode instead of the original function plainUnpackLongs or unrolledUnpackN 5. Added the testcases "RleV2_basic_vector_decode_Nbit" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc. 6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking. 7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one. Add new files: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {text-align:left; vertical-align:middle;} .xl66 {text-align:left; vertical-align:middle; white-space:normal;} --> </style> </head> <body link="#0563C1" vlink="#954F72"> New Files | File Purpose -- | -- DetectPlatform.hh | Dynamically detect the current platform supportsĀ AVX-512 or not. If yes, will use AVX-512 vector decode, if not, will still the original decode functions. VectorDecoder.hh | This file contains the new macros, arrays, and unions which AVX-512 vector decode needs. TestRleVectorDecoder.cc | New testcases to do unit and funcational test about this new feature </body> </html> ### Why are the changes needed? This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function unrolledUnpackVectorN with the original Rle-bit-packing decode function plainUnpackLongs. As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times. ### How was this patch tested? I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DENABLE_AVX512_BIT_PACKING=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios: 1. The blockSize increases from 1 to 10000, and data length is 10240; 2. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000 The testcase will be executed for a while, so I added a progress bar for every testcase. Here is a progress bar demo print of one testcase: [ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1 10bit Test 1st Part:100% [##################################################] [10000/10000] 10bit Test 2nd Part:100% [##################################################] [10000/10000] To the main vector function unrolledUnpackVectorN, the test code coverage upto 100%. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/pengwan5/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {text-align:center; vertical-align:middle;} --> </style> </head> <body link="#0563C1" vlink="#954F72"> New Testcases | Test Data Bit Size -- | -- RleV2_basic_vector_decode_1bit | 1bit RleV2_basic_vector_decode_2bit | 2bit RleV2_basic_vector_decode_3bit | 3bit RleV2_basic_vector_decode_4bit | 4bit RleV2_basic_vector_decode_5bit | 5bit RleV2_basic_vector_decode_6bit | 6bit RleV2_basic_vector_decode_7bit | 7bit RleV2_basic_vector_decode_9bit | 9bit RleV2_basic_vector_decode_10bit | 10bit RleV2_basic_vector_decode_11bit | 11bit RleV2_basic_vector_decode_12bit | 12bit RleV2_basic_vector_decode_13bit | 13bit RleV2_basic_vector_decode_14bit | 14bit RleV2_basic_vector_decode_15bit | 15bit RleV2_basic_vector_decode_16bit | 16bit RleV2_basic_vector_decode_17bit | 17bit RleV2_basic_vector_decode_18bit | 18bit RleV2_basic_vector_decode_19bit | 19bit RleV2_basic_vector_decode_20bit | 20bit RleV2_basic_vector_decode_21bit | 21bit RleV2_basic_vector_decode_22bit | 22bit RleV2_basic_vector_decode_23bit | 23bit RleV2_basic_vector_decode_24bit | 24bit RleV2_basic_vector_decode_26bit | 26bit RleV2_basic_vector_decode_28bit | 28bit RleV2_basic_vector_decode_30bit | 30bit RleV2_basic_vector_decode_32bit | 32bit </body> </html> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
