Hello Andrew Wong,
I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/13591
to review the following change.
Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
......................................................................
KUDU-2846 (part 1): optimize predicate evaluation for primitives
This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.
Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.
Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA. Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.
perf-stat before:
Performance counter stats for 'build/latest/bin/column_predicate-test
--gtest_filter=*Bench*':
82185.366028 task-clock (msec) # 0.997 CPUs utilized
288,909,311,749 cycles # 3.515 GHz
956,410,925,173 instructions # 3.31 insn per cycle
149,468,823,714 branches # 1818.679 M/sec
1,237,139,955 branch-misses # 0.83% of all branches
82.398392581 seconds time elapsed
82.132012000 seconds user
0.055937000 seconds sys
perf-stat after:
Performance counter stats for 'build/latest/bin/column_predicate-test
--gtest_filter=*Bench*':
42626.067916 task-clock (msec) # 0.996 CPUs utilized
149,363,412,476 cycles # 3.504 GHz
190,514,045,889 instructions # 1.28 insn per cycle
19,902,815,659 branches # 466.917 M/sec
63,130,874 branch-misses # 0.32% of all branches
Detailed results before:
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NOT
NULL: real 1.730s user 1.730s sys 0.002s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NULL:
real 2.097s user 2.096s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NOT
NULL: real 1.755s user 1.756s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8
NULL: real 2.631s user 2.632s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int8 NOT NULL: real 1.850s user 1.848s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int8 NULL: real 2.808s user 2.808s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NOT
NULL: real 1.753s user 1.752s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16
NULL: real 2.248s user 2.244s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NOT
NULL: real 1.750s user 1.752s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16
NULL: real 2.420s user 2.416s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int16 NOT NULL: real 1.811s user 1.808s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int16 NULL: real 5.321s user 5.313s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NOT
NULL: real 1.834s user 1.824s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32
NULL: real 2.233s user 2.232s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NOT
NULL: real 1.797s user 1.793s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32
NULL: real 2.791s user 2.774s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int32 NOT NULL: real 1.873s user 1.869s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int32 NULL: real 3.104s user 3.071s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NOT
NULL: real 1.781s user 1.779s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64
NULL: real 2.209s user 2.203s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NOT
NULL: real 1.741s user 1.739s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64
NULL: real 2.374s user 2.374s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int64 NOT NULL: real 1.769s user 1.767s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int64 NULL: real 3.113s user 3.099s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NOT
NULL: real 1.766s user 1.765s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float
NULL: real 2.305s user 2.299s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NOT
NULL: real 1.755s user 1.752s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float
NULL: real 2.685s user 2.678s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
float NOT NULL: real 1.777s user 1.771s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
float NULL: real 2.940s user 2.929s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NOT
NULL: real 1.756s user 1.749s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double
NULL: real 2.443s user 2.438s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double
NOT NULL: real 1.819s user 1.819s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double
NULL: real 2.744s user 2.724s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
double NOT NULL: real 1.753s user 1.746s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
double NULL: real 2.481s user 2.460s sys 0.004s
Detailed results after:
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NOT
NULL: real 1.082s user 1.073s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NULL:
real 1.069s user 1.063s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NOT
NULL: real 1.085s user 1.076s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8
NULL: real 1.071s user 1.068s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int8 NOT NULL: real 1.191s user 1.191s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int8 NULL: real 1.209s user 1.206s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NOT
NULL: real 1.099s user 1.099s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16
NULL: real 1.123s user 1.106s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NOT
NULL: real 1.100s user 1.100s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16
NULL: real 1.070s user 1.068s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int16 NOT NULL: real 1.211s user 1.212s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int16 NULL: real 1.220s user 1.220s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NOT
NULL: real 1.104s user 1.104s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32
NULL: real 1.105s user 1.104s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NOT
NULL: real 1.107s user 1.108s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32
NULL: real 1.081s user 1.080s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int32 NOT NULL: real 1.230s user 1.228s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int32 NULL: real 1.219s user 1.220s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NOT
NULL: real 1.071s user 1.072s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64
NULL: real 1.090s user 1.088s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NOT
NULL: real 1.069s user 1.067s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64
NULL: real 1.083s user 1.084s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int64 NOT NULL: real 1.253s user 1.252s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
int64 NULL: real 1.248s user 1.248s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NOT
NULL: real 1.144s user 1.144s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float
NULL: real 1.144s user 1.144s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NOT
NULL: real 1.159s user 1.160s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float
NULL: real 1.214s user 1.216s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
float NOT NULL: real 1.439s user 1.436s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
float NULL: real 1.457s user 1.458s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NOT
NULL: real 1.196s user 1.195s sys 0.000s
Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double
NULL: real 1.213s user 1.212s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double
NOT NULL: real 1.232s user 1.230s sys 0.000s
Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double
NULL: real 1.256s user 1.241s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
double NOT NULL: real 1.419s user 1.418s sys 0.000s
Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type
double NULL: real 1.430s user 1.426s sys 0.000s
Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
---
M src/kudu/common/column_predicate-test.cc
M src/kudu/common/column_predicate.cc
2 files changed, 135 insertions(+), 12 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/13591/1
--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>