niyue commented on issue #40024: URL: https://github.com/apache/arrow/issues/40024#issuecomment-1937460521
I added several micro benchmarks to verify the expression compilation performance (previous micro benchmarks primarily focus on execution performance instead of compilation performance). # all micro benchmarks  * The first 6 micro benchmarks are about compilation performance. * This PR is not expected to change execution performance of generated code, so besides the first 6, the remaining benchmarks are almost not changed. # The first 6 benchmarks  # The first 6 benchmarks (log scale)  # The detailed benchmark stats ### before optimization ``` 2024-02-11T15:09:54+08:00 Running release/gandiva-micro-benchmarks Run on (10 X 24.1211 MHz CPU s) CPU Caches: L1 Data 64 KiB L1 Instruction 128 KiB L2 Unified 4096 KiB (x10) Load Average: 2.59, 5.07, 4.75 /Users/ss/dev/projects/opensource/arrow/cpp/src/gandiva/cache.cc:50: Creating gandiva cache with capacity of 500 /Users/ss/dev/projects/opensource/arrow/cpp/src/gandiva/engine.cc:276: Detected CPU Name : apple-m1 /Users/ss/dev/projects/opensource/arrow/cpp/src/gandiva/engine.cc:277: Detected CPU Features: [] -------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------- TimedTestExprCompilationNoCache 14760 us 14759 us 39 TimedTestExprCompilationWithCache 227 us 226 us 3094 TimedTestNonBitcodeExprCompilationNoCache 13051 us 13047 us 46 TimedTestNonBitcodeExprCompilationWithCache 238 us 238 us 2916 TimedTestLiteralExprCompilationNoCache 227 us 227 us 2990 TimedTestLiteralExprCompilationWithCache 230 us 230 us 3034 TimedTestAdd3 1134 us 1128 us 635 TimedTestBigNested 7856 us 7854 us 88 TimedTestExtractYear 7183 us 7173 us 98 TimedTestFilterAdd2 2828 us 2828 us 249 TimedTestFilterLike 12836 us 12833 us 55 TimedTestCastFloatFromString 14497 us 14495 us 48 TimedTestCastIntFromString 14271 us 14271 us 49 TimedTestAllocs 34164 us 34164 us 21 TimedTestOutputStringAllocs 51252 us 51230 us 14 TimedTestMultiOr 9022 us 9022 us 78 DecimalAdd2Fast 2054 us 2048 us 348 DecimalAdd2LeadingZeroes 5060 us 5059 us 138 DecimalAdd2LeadingZeroesWithDiv 23955 us 23948 us 29 DecimalAdd2Large 118613 us 118586 us 6 DecimalAdd3Fast 2340 us 2332 us 304 DecimalAdd3LeadingZeroes 8752 us 8751 us 79 DecimalAdd3LeadingZeroesWithDiv 60829 us 60811 us 11 DecimalAdd3Large 241113 us 241100 us 3 ``` ### after optimization ``` 2024-02-11T15:11:43+08:00 Running release/gandiva-micro-benchmarks Run on (10 X 24.1228 MHz CPU s) CPU Caches: L1 Data 64 KiB L1 Instruction 128 KiB L2 Unified 4096 KiB (x10) Load Average: 2.83, 4.38, 4.51 /Users/ss/dev/projects/opensource/arrow/cpp/src/gandiva/cache.cc:50: Creating gandiva cache with capacity of 500 /Users/ss/dev/projects/opensource/arrow/cpp/src/gandiva/engine.cc:273: Detected CPU Name : apple-m1 /Users/ss/dev/projects/opensource/arrow/cpp/src/gandiva/engine.cc:274: Detected CPU Features: [] -------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------- TimedTestExprCompilationNoCache 14382 us 14380 us 39 TimedTestExprCompilationWithCache 82.4 us 82.4 us 8394 TimedTestNonBitcodeExprCompilationNoCache 1255 us 1255 us 499 TimedTestNonBitcodeExprCompilationWithCache 90.6 us 90.6 us 7689 TimedTestLiteralExprCompilationNoCache 82.1 us 82.1 us 8528 TimedTestLiteralExprCompilationWithCache 85.6 us 85.6 us 8167 TimedTestAdd3 1140 us 1133 us 599 TimedTestBigNested 7818 us 7817 us 89 TimedTestExtractYear 7187 us 7184 us 98 TimedTestFilterAdd2 2809 us 2809 us 249 TimedTestFilterLike 13097 us 13093 us 54 TimedTestCastFloatFromString 14168 us 14168 us 49 TimedTestCastIntFromString 14164 us 14159 us 49 TimedTestAllocs 33802 us 33802 us 21 TimedTestOutputStringAllocs 50598 us 50592 us 13 TimedTestMultiOr 11379 us 11378 us 63 TimedTestInExpr 2509 us 2509 us 273 DecimalAdd2Fast 2029 us 2029 us 340 DecimalAdd2LeadingZeroes 5153 us 5151 us 135 DecimalAdd2LeadingZeroesWithDiv 24197 us 24164 us 29 DecimalAdd2Large 118994 us 118917 us 6 DecimalAdd3Fast 2281 us 2280 us 295 DecimalAdd3LeadingZeroes 8937 us 8935 us 78 DecimalAdd3LeadingZeroesWithDiv 60969 us 60966 us 11 DecimalAdd3Large 241916 us 241723 us 3 ``` # Conclusion * The `TimedTestExprCompilationNoCache` is slightly faster (around 2% faster) because the compilation is faster but the execution time still dominates this benchmark * The `TimedTestExprCompilationWithCache`, `TimedTestNonBitcodeExprCompilationWithCache` and `TimedTestLiteralExprCompilationWithCache` is faster primarily because we avoid loading the IR and C functions if cache is hit. They are around 2.5x faster. * The `TimedTestNonBitcodeExprCompilationNoCache` and `TimedTestLiteralExprCompilationNoCache` are 10x and 2.5x faster. For use cases where only C functions are used, such as `random()`, the compilation should be much faster since LLVM bitcode is not needed to be loaded and linked any more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
