Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/13591 to look at the new patch set (#3). Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives ...................................................................... KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by 1.6-2.5x depending on the particular predicate, type, and nullability (average around 2x). Branches are reduced by about 7.5x and branch-misses by about 19.6x. Looking at the "after" perf-stat results, the instructions-per-cycle are way down, which indicates we're probably stalled on instruction dependencies or port saturation. This is also indicated by the fact that the smaller ints don't seem to run any faster than the large ints (which wouldn't be the case if we were limited by load/store bandwidth). Likely the next fix here is to use SIMD to do comparisons in parallel as suggested in the JIRA. Unfortunately, the compiler doesn't seem to auto-vectorize these loops, so if we want further gain, we'll have to add some more hand-written vectorization code. So, we'll start with this easy win. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 82185.366028 task-clock (msec) # 0.997 CPUs utilized 288,909,311,749 cycles # 3.515 GHz 956,410,925,173 instructions # 3.31 insn per cycle 149,468,823,714 branches # 1818.679 M/sec 1,237,139,955 branch-misses # 0.83% of all branches perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 42626.067916 task-clock (msec) # 0.996 CPUs utilized 149,363,412,476 cycles # 3.504 GHz 190,514,045,889 instructions # 1.28 insn per cycle 19,902,815,659 branches # 466.917 M/sec 63,130,874 branch-misses # 0.32% of all branches Detailed results before: int8 NOT NULL (c = 0) 573.9M evals/sec 4.78 cycles/eval int8 NULL (c = 0) 456.2M evals/sec 6.14 cycles/eval int8 NOT NULL (c >= 0) 573.5M evals/sec 4.79 cycles/eval int8 NULL (c >= 0) 420.3M evals/sec 6.71 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 565.1M evals/sec 4.87 cycles/eval int8 NULL (c >= 0 AND c < 2) 372.0M evals/sec 7.53 cycles/eval int16 NOT NULL (c = 0) 577.0M evals/sec 4.75 cycles/eval int16 NULL (c = 0) 460.5M evals/sec 6.06 cycles/eval int16 NOT NULL (c >= 0) 568.9M evals/sec 4.80 cycles/eval int16 NULL (c >= 0) 400.4M evals/sec 6.96 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval int16 NULL (c >= 0 AND c < 2) 299.4M evals/sec 9.40 cycles/eval int32 NOT NULL (c = 0) 543.8M evals/sec 5.05 cycles/eval int32 NULL (c = 0) 446.2M evals/sec 6.21 cycles/eval int32 NOT NULL (c >= 0) 565.5M evals/sec 4.84 cycles/eval int32 NULL (c >= 0) 380.4M evals/sec 7.36 cycles/eval int32 NOT NULL (c >= 0 AND c < 2) 561.8M evals/sec 4.91 cycles/eval int32 NULL (c >= 0 AND c < 2) 308.6M evals/sec 9.18 cycles/eval int64 NOT NULL (c = 0) 566.6M evals/sec 4.88 cycles/eval int64 NULL (c = 0) 463.9M evals/sec 6.07 cycles/eval int64 NOT NULL (c >= 0) 555.5M evals/sec 4.97 cycles/eval int64 NULL (c >= 0) 385.3M evals/sec 7.28 cycles/eval int64 NOT NULL (c >= 0 AND c < 2) 567.1M evals/sec 4.83 cycles/eval int64 NULL (c >= 0 AND c < 2) 194.7M evals/sec 14.61 cycles/eval float NOT NULL (c = 0) 584.5M evals/sec 4.68 cycles/eval float NULL (c = 0) 441.4M evals/sec 6.29 cycles/eval float NOT NULL (c >= 0) 576.6M evals/sec 4.74 cycles/eval float NULL (c >= 0) 361.1M evals/sec 7.74 cycles/eval float NOT NULL (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval float NULL (c >= 0 AND c < 2) 301.5M evals/sec 9.34 cycles/eval double NOT NULL (c = 0) 589.9M evals/sec 4.64 cycles/eval double NULL (c = 0) 450.0M evals/sec 6.15 cycles/eval double NOT NULL (c >= 0) 571.5M evals/sec 4.78 cycles/eval double NULL (c >= 0) 367.8M evals/sec 7.60 cycles/eval double NOT NULL (c >= 0 AND c < 2) 577.8M evals/sec 4.77 cycles/eval double NULL (c >= 0 AND c < 2) 429.5M evals/sec 6.49 cycles/eval Detailed results after: int8 NOT NULL (c = 0) 926.7M evals/sec 3.01 cycles/eval int8 NULL (c = 0) 935.2M evals/sec 2.98 cycles/eval int8 NOT NULL (c >= 0) 913.6M evals/sec 3.03 cycles/eval int8 NULL (c >= 0) 903.2M evals/sec 3.08 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 824.3M evals/sec 3.35 cycles/eval int8 NULL (c >= 0 AND c < 2) 814.5M evals/sec 3.38 cycles/eval int16 NOT NULL (c = 0) 900.6M evals/sec 3.07 cycles/eval int16 NULL (c = 0) 946.9M evals/sec 2.93 cycles/eval int16 NOT NULL (c >= 0) 925.8M evals/sec 2.99 cycles/eval int16 NULL (c >= 0) 922.6M evals/sec 3.00 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 819.7M evals/sec 3.35 cycles/eval int16 NULL (c >= 0 AND c < 2) 822.8M evals/sec 3.34 cycles/eval int32 NOT NULL (c = 0) 894.0M evals/sec 3.09 cycles/eval int32 NULL (c = 0) 916.3M evals/sec 3.01 cycles/eval int32 NOT NULL (c >= 0) 916.2M evals/sec 3.02 cycles/eval int32 NULL (c >= 0) 933.2M evals/sec 2.97 cycles/eval int32 NOT NULL (c >= 0 AND c < 2) 863.5M evals/sec 3.17 cycles/eval int32 NULL (c >= 0 AND c < 2) 866.4M evals/sec 3.16 cycles/eval int64 NOT NULL (c = 0) 949.9M evals/sec 2.92 cycles/eval int64 NULL (c = 0) 936.2M evals/sec 2.96 cycles/eval int64 NOT NULL (c >= 0) 950.2M evals/sec 2.92 cycles/eval int64 NULL (c >= 0) 926.0M evals/sec 2.99 cycles/eval int64 NOT NULL (c >= 0 AND c < 2) 835.5M evals/sec 3.29 cycles/eval int64 NULL (c >= 0 AND c < 2) 835.6M evals/sec 3.30 cycles/eval float NOT NULL (c = 0) 936.5M evals/sec 2.95 cycles/eval float NULL (c = 0) 933.0M evals/sec 2.97 cycles/eval float NOT NULL (c >= 0) 852.2M evals/sec 3.27 cycles/eval float NULL (c >= 0) 838.3M evals/sec 3.32 cycles/eval float NOT NULL (c >= 0 AND c < 2) 691.9M evals/sec 3.97 cycles/eval float NULL (c >= 0 AND c < 2) 705.3M evals/sec 3.90 cycles/eval double NOT NULL (c = 0) 898.3M evals/sec 3.08 cycles/eval double NULL (c = 0) 879.7M evals/sec 3.14 cycles/eval double NOT NULL (c >= 0) 800.0M evals/sec 3.46 cycles/eval double NULL (c >= 0) 836.6M evals/sec 3.32 cycles/eval double NOT NULL (c >= 0 AND c < 2) 719.2M evals/sec 3.83 cycles/eval double NULL (c >= 0 AND c < 2) 721.1M evals/sec 3.82 cycles/eval Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 --- M src/kudu/common/CMakeLists.txt M src/kudu/common/column_predicate-test.cc M src/kudu/common/column_predicate.cc 3 files changed, 146 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/13591/3 -- To view, visit http://gerrit.cloudera.org:8080/13591 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 Gerrit-Change-Number: 13591 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <t...@apache.org>