Hello Tidy Bot, Andrew Wong, Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/15802
to look at the new patch set (#2).
Change subject: KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock
Arena
......................................................................
KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock Arena
This changes Dictionary and Plain binary blocks to no longer copy string
values into the destination RowBlock's Arena. Instead, the dictionary
block (or plain block) is attached to the RowBlockMemory's list of
reference-counted block handles, and the cell directly refers to the
underlying block handle.
I performance-tested this by loading a lineitem table using
tpch_real_world:
$ tpch_real_world -tpch-path-to-dbgen-dir /data/2/todd/dbgen/ \
-tpch_num_inserters=8 -tpch_scaling_factor=100 \
--tpch_mini_cluster_base_dir /data/1/todd/tpch-kudu-data \
For the numbers reported below, I accidentally cancelled the load after
347M rows were present, but I used the same dataset for both "before"
and "after" so the relative comparison is still valid.
I started a tserver and master with the same data directories, with
the tserver running inside perf stat (or perf record to look at
profile):
$ kudu-master \
-fs-wal-dir /data/1/todd/tpch-kudu-data/master-0/wal/ \
-fs-data-dirs /data/1/todd/tpch-kudu-data/master-0/data/
$ perf stat kudu-tserver \
-fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ \
-fs-data-dirs /data/1/todd/tpch-kudu-data/ts-0/data/
I waited until the data had been fully flushed from MRS before running
the read workloads. To test the reads I used:
$ kudu perf table_scan localhost tpch_real_world \
--columns l_shipdate --num_threads=16
The results of the first test were a bit noisy due to NUMA placement
issues -- some runs were 30-40% faster than other runs, even on the same
build, which made it hard to compare results, even though it was clear
that the optimized version used fewer cycles on average. So, I ran both
the tserver and the client using 'numactl -m 0 -N 0' to force everything
to a single NUMA node. This made results much more consistent.
Before:
Performance counter stats for 'numactl -m 0 -N 0 build/latest/bin/kudu-tserver
-fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ -fs-data-dirs
/data/1/todd/tpch-kudu-data/ts-0/data/':
109,499.82 msec task-clock # 2.157 CPUs utilized
126,972 context-switches # 0.001 M/sec
860 cpu-migrations # 0.008 K/sec
179,152 page-faults # 0.002 M/sec
287,368,365,068 cycles # 2.624 GHz
(28.49%)
449,376,563,201 instructions # 1.56 insn per cycle
(43.78%)
72,675,361,468 branches # 663.703 M/sec
(27.51%)
1,498,319,620 branch-misses # 2.06% of all branches
(22.68%)
88.573074000 seconds user
30.057621000 seconds sys
After:
Performance counter stats for 'numactl -m 0 -N 0
build/latest/bin/kudu-tserver.threadsafe -fs-wal-dir
/data/1/todd/tpch-kudu-data/ts-0/wal/ -fs-data-dirs
/data/1/todd/tpch-kudu-data/ts-0/data/':
79,082.33 msec task-clock # 2.136 CPUs utilized
123,454 context-switches # 0.002 M/sec
651 cpu-migrations # 0.008 K/sec
180,599 page-faults # 0.002 M/sec
206,372,306,012 cycles # 2.610 GHz
(28.27%)
333,399,478,320 instructions # 1.62 insn per cycle
(43.61%)
45,056,751,917 branches # 569.745 M/sec
(27.44%)
223,361,581 branch-misses # 0.50% of all branches
(22.77%)
57.790049000 seconds user
29.709481000 seconds sys
To summarize, the change gives about 1.39x reduction in CPU cycles on the
tserver. The wall clock reported by the perf scan tool showed about a 20%
reduction in wall-clock.
Looking at 'perf diff -c ratio -o 0 --sort=sym' between two profiles collected
by perf-record, we can see that the BinaryDictBlockDecoder code path is much
cheaper, most of the memcpy calls are removed, and slightly more CPU is spent
serializing the result (probably due to reduced locality of reference copying
the string data to the wire).
Orig % CPU Ratio Symbol
-------------------------------------------------------------------------------------
26.16% 0.437272 [.]
kudu::cfile::BinaryDictBlockDecoder::CopyNextDecodeStrings(unsigned long*,
kudu::ColumnDataView*)
19.63% 1.135304 [.] kudu::SerializeRowBlock(kudu::RowBlock const&,
kudu::Schema const*, kudu::faststring*, kudu::faststring*, bool)
11.49% 1.002845 [k] copy_user_enhanced_fast_string
10.29% 0.068955 [.] __memcpy_ssse3_back
Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441
---
M src/kudu/cfile/binary_dict_block.cc
M src/kudu/cfile/binary_plain_block.cc
M src/kudu/cfile/binary_plain_block.h
M src/kudu/cfile/block_handle.h
M src/kudu/cfile/cfile_reader.cc
M src/kudu/common/rowblock_memory.h
6 files changed, 60 insertions(+), 17 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/15802/2
--
To view, visit http://gerrit.cloudera.org:8080/15802
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441
Gerrit-Change-Number: 15802
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)