Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/13591
to look at the new patch set (#3).
Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
......................................................................
KUDU-2846 (part 1): optimize predicate evaluation for primitives
This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.
Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.
Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA. Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.
perf-stat before:
Performance counter stats for 'build/latest/bin/column_predicate-test
--gtest_filter=*Bench*':
82185.366028 task-clock (msec) # 0.997 CPUs utilized
288,909,311,749 cycles # 3.515 GHz
956,410,925,173 instructions # 3.31 insn per cycle
149,468,823,714 branches # 1818.679 M/sec
1,237,139,955 branch-misses # 0.83% of all branches
perf-stat after:
Performance counter stats for 'build/latest/bin/column_predicate-test
--gtest_filter=*Bench*':
42626.067916 task-clock (msec) # 0.996 CPUs utilized
149,363,412,476 cycles # 3.504 GHz
190,514,045,889 instructions # 1.28 insn per cycle
19,902,815,659 branches # 466.917 M/sec
63,130,874 branch-misses # 0.32% of all branches
Detailed results before:
int8 NOT NULL (c = 0) 573.9M evals/sec 4.78 cycles/eval
int8 NULL (c = 0) 456.2M evals/sec 6.14 cycles/eval
int8 NOT NULL (c >= 0) 573.5M evals/sec 4.79 cycles/eval
int8 NULL (c >= 0) 420.3M evals/sec 6.71 cycles/eval
int8 NOT NULL (c >= 0 AND c < 2) 565.1M evals/sec 4.87 cycles/eval
int8 NULL (c >= 0 AND c < 2) 372.0M evals/sec 7.53 cycles/eval
int16 NOT NULL (c = 0) 577.0M evals/sec 4.75 cycles/eval
int16 NULL (c = 0) 460.5M evals/sec 6.06 cycles/eval
int16 NOT NULL (c >= 0) 568.9M evals/sec 4.80 cycles/eval
int16 NULL (c >= 0) 400.4M evals/sec 6.96 cycles/eval
int16 NOT NULL (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval
int16 NULL (c >= 0 AND c < 2) 299.4M evals/sec 9.40 cycles/eval
int32 NOT NULL (c = 0) 543.8M evals/sec 5.05 cycles/eval
int32 NULL (c = 0) 446.2M evals/sec 6.21 cycles/eval
int32 NOT NULL (c >= 0) 565.5M evals/sec 4.84 cycles/eval
int32 NULL (c >= 0) 380.4M evals/sec 7.36 cycles/eval
int32 NOT NULL (c >= 0 AND c < 2) 561.8M evals/sec 4.91 cycles/eval
int32 NULL (c >= 0 AND c < 2) 308.6M evals/sec 9.18 cycles/eval
int64 NOT NULL (c = 0) 566.6M evals/sec 4.88 cycles/eval
int64 NULL (c = 0) 463.9M evals/sec 6.07 cycles/eval
int64 NOT NULL (c >= 0) 555.5M evals/sec 4.97 cycles/eval
int64 NULL (c >= 0) 385.3M evals/sec 7.28 cycles/eval
int64 NOT NULL (c >= 0 AND c < 2) 567.1M evals/sec 4.83 cycles/eval
int64 NULL (c >= 0 AND c < 2) 194.7M evals/sec 14.61 cycles/eval
float NOT NULL (c = 0) 584.5M evals/sec 4.68 cycles/eval
float NULL (c = 0) 441.4M evals/sec 6.29 cycles/eval
float NOT NULL (c >= 0) 576.6M evals/sec 4.74 cycles/eval
float NULL (c >= 0) 361.1M evals/sec 7.74 cycles/eval
float NOT NULL (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval
float NULL (c >= 0 AND c < 2) 301.5M evals/sec 9.34 cycles/eval
double NOT NULL (c = 0) 589.9M evals/sec 4.64 cycles/eval
double NULL (c = 0) 450.0M evals/sec 6.15 cycles/eval
double NOT NULL (c >= 0) 571.5M evals/sec 4.78 cycles/eval
double NULL (c >= 0) 367.8M evals/sec 7.60 cycles/eval
double NOT NULL (c >= 0 AND c < 2) 577.8M evals/sec 4.77 cycles/eval
double NULL (c >= 0 AND c < 2) 429.5M evals/sec 6.49 cycles/eval
Detailed results after:
int8 NOT NULL (c = 0) 926.7M evals/sec 3.01 cycles/eval
int8 NULL (c = 0) 935.2M evals/sec 2.98 cycles/eval
int8 NOT NULL (c >= 0) 913.6M evals/sec 3.03 cycles/eval
int8 NULL (c >= 0) 903.2M evals/sec 3.08 cycles/eval
int8 NOT NULL (c >= 0 AND c < 2) 824.3M evals/sec 3.35 cycles/eval
int8 NULL (c >= 0 AND c < 2) 814.5M evals/sec 3.38 cycles/eval
int16 NOT NULL (c = 0) 900.6M evals/sec 3.07 cycles/eval
int16 NULL (c = 0) 946.9M evals/sec 2.93 cycles/eval
int16 NOT NULL (c >= 0) 925.8M evals/sec 2.99 cycles/eval
int16 NULL (c >= 0) 922.6M evals/sec 3.00 cycles/eval
int16 NOT NULL (c >= 0 AND c < 2) 819.7M evals/sec 3.35 cycles/eval
int16 NULL (c >= 0 AND c < 2) 822.8M evals/sec 3.34 cycles/eval
int32 NOT NULL (c = 0) 894.0M evals/sec 3.09 cycles/eval
int32 NULL (c = 0) 916.3M evals/sec 3.01 cycles/eval
int32 NOT NULL (c >= 0) 916.2M evals/sec 3.02 cycles/eval
int32 NULL (c >= 0) 933.2M evals/sec 2.97 cycles/eval
int32 NOT NULL (c >= 0 AND c < 2) 863.5M evals/sec 3.17 cycles/eval
int32 NULL (c >= 0 AND c < 2) 866.4M evals/sec 3.16 cycles/eval
int64 NOT NULL (c = 0) 949.9M evals/sec 2.92 cycles/eval
int64 NULL (c = 0) 936.2M evals/sec 2.96 cycles/eval
int64 NOT NULL (c >= 0) 950.2M evals/sec 2.92 cycles/eval
int64 NULL (c >= 0) 926.0M evals/sec 2.99 cycles/eval
int64 NOT NULL (c >= 0 AND c < 2) 835.5M evals/sec 3.29 cycles/eval
int64 NULL (c >= 0 AND c < 2) 835.6M evals/sec 3.30 cycles/eval
float NOT NULL (c = 0) 936.5M evals/sec 2.95 cycles/eval
float NULL (c = 0) 933.0M evals/sec 2.97 cycles/eval
float NOT NULL (c >= 0) 852.2M evals/sec 3.27 cycles/eval
float NULL (c >= 0) 838.3M evals/sec 3.32 cycles/eval
float NOT NULL (c >= 0 AND c < 2) 691.9M evals/sec 3.97 cycles/eval
float NULL (c >= 0 AND c < 2) 705.3M evals/sec 3.90 cycles/eval
double NOT NULL (c = 0) 898.3M evals/sec 3.08 cycles/eval
double NULL (c = 0) 879.7M evals/sec 3.14 cycles/eval
double NOT NULL (c >= 0) 800.0M evals/sec 3.46 cycles/eval
double NULL (c >= 0) 836.6M evals/sec 3.32 cycles/eval
double NOT NULL (c >= 0 AND c < 2) 719.2M evals/sec 3.83 cycles/eval
double NULL (c >= 0 AND c < 2) 721.1M evals/sec 3.82 cycles/eval
Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
---
M src/kudu/common/CMakeLists.txt
M src/kudu/common/column_predicate-test.cc
M src/kudu/common/column_predicate.cc
3 files changed, 146 insertions(+), 13 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/13591/3
--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <[email protected]>