Hello Adar Dembo, Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/6569
to look at the new patch set (#5).
Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................
KUDU-680. bloomfile: switch to a threadlocal cache
Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:
1TB of data / (32MB/rowset) = 32k rowsets
32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
IndexTreeIterator)
= approximately 246MB of RAM
This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.
This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread object cache for
use cases like this. With the new batched Apply() path for writes, we
can expect that each subsequent operation is likely to hit the same
rowset over and over in a row, which means that we're likely to get a
good hit rate even with such a small cache.
In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.
As a microbenchmark, I used mt-bloomfile-test:
Before:
Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
16632.878265 task-clock # 7.212 CPUs utilized
( +- 0.68% )
318,630 context-switches # 0.019 M/sec
( +- 1.70% )
43 cpu-migrations # 0.003 K/sec
( +- 15.50% )
1,563 page-faults # 0.094 K/sec
( +- 0.06% )
47,956,020,453 cycles # 2.883 GHz
( +- 0.66% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
17,189,022,099 instructions # 0.36 insns per cycle
( +- 0.26% )
3,691,965,566 branches # 221.968 M/sec
( +- 0.29% )
31,842,288 branch-misses # 0.86% of all branches
( +- 0.33% )
2.306319236 seconds time elapsed
( +- 0.59% )
After:
Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
11314.801368 task-clock # 7.230 CPUs utilized
( +- 0.74% )
213,126 context-switches # 0.019 M/sec
( +- 2.04% )
27 cpu-migrations # 0.002 K/sec
( +- 13.33% )
1,547 page-faults # 0.137 K/sec
( +- 0.05% )
32,714,275,234 cycles # 2.891 GHz
( +- 0.72% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
13,071,571,872 instructions # 0.40 insns per cycle
( +- 0.35% )
2,719,869,660 branches # 240.382 M/sec
( +- 0.49% )
27,950,304 branch-misses # 1.03% of all branches
( +- 0.39% )
(46% savings on cycles)
As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times
These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.
The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
reduction in block cache lookups per second)
As another end-to-end benchmark, I used YCSB, also using the same local
patches mentioned above so that it wasn't memory-throttling-bound. In
the YCSB "load" phase, the insertions are fully uniform-random, so we
don't expect to see performance benefits, but we do expect to see the
memory usage stay more consistent, and to not see any serious perf
regression. In practice, the results were as predicted, detailed in the
document linked below.
Various graphs from the benchmark runs are posted here:
https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing
Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.h
8 files changed, 231 insertions(+), 84 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/5
--
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>