Hello Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/6569
to look at the new patch set (#3).
Change subject: WIP: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................
WIP: KUDU-680. bloomfile: switch to a threadlocal cache
Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:
1TB of data / (32MB/rowset) = 32k rowsets
32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
IndexTreeIterator)
= approximately 246MB of RAM
This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.
This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread cache for use cases
like this. With the new batched Apply() path for writes, we can expect
that each subsequent operation is likely to hit the same rowset over and
over in a row, which means that we're likely to get a good hit rate even
with such a small cache.
In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.
WIP: should run YCSB as a second end-to-end benchmark
As a microbenchmark, I used mt-bloomfile-test:
Before:
Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
16632.878265 task-clock # 7.212 CPUs utilized
( +- 0.68% )
318,630 context-switches # 0.019 M/sec
( +- 1.70% )
43 cpu-migrations # 0.003 K/sec
( +- 15.50% )
1,563 page-faults # 0.094 K/sec
( +- 0.06% )
47,956,020,453 cycles # 2.883 GHz
( +- 0.66% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
17,189,022,099 instructions # 0.36 insns per cycle
( +- 0.26% )
3,691,965,566 branches # 221.968 M/sec
( +- 0.29% )
31,842,288 branch-misses # 0.86% of all branches
( +- 0.33% )
2.306319236 seconds time elapsed
( +- 0.59% )
After:
Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
11573.863720 task-clock # 7.167 CPUs utilized
( +- 0.91% )
213,283 context-switches # 0.018 M/sec
( +- 2.31% )
29 cpu-migrations # 0.003 K/sec
( +- 10.75% )
1,566 page-faults # 0.135 K/sec
( +- 0.06% )
33,399,534,315 cycles # 2.886 GHz
( +- 0.90% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
13,043,928,426 instructions # 0.39 insns per cycle
( +- 0.45% )
2,722,013,296 branches # 235.186 M/sec
( +- 0.58% )
27,412,912 branch-misses # 1.01% of all branches
( +- 0.79% )
1.614814095 seconds time elapsed
( +- 0.86% )
(43% savings on cycles)
As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times
These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.
The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
reduction in block cache lookups per second)
Various graphs from the benchmark runs are posted here:
https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing
Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/CMakeLists.txt
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.cc
A src/kudu/util/threadlocal_cache.h
10 files changed, 265 insertions(+), 86 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/3
--
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot