[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Todd Lipcon (Code Review) Tue, 11 Apr 2017 18:54:03 -0700

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit


    http://gerrit.cloudera.org:8080/6569

to look at the new patch set (#5).

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................

KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread object cache for
use cases like this. With the new batched Apply() path for writes, we
can expect that each subsequent operation is likely to hit the same
rowset over and over in a row, which means that we're likely to get a
good hit rate even with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized         
   ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                 
   ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                 
   ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                 
   ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                   
   ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle       
   ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                 
   ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches       
   ( +-  0.33% )

         2.306319236 seconds time elapsed                                       
   ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11314.801368 task-clock                #    7.230 CPUs utilized         
   ( +-  0.74% )
             213,126 context-switches          #    0.019 M/sec                 
   ( +-  2.04% )
                  27 cpu-migrations            #    0.002 K/sec                 
   ( +- 13.33% )
               1,547 page-faults               #    0.137 K/sec                 
   ( +-  0.05% )
      32,714,275,234 cycles                    #    2.891 GHz                   
   ( +-  0.72% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,071,571,872 instructions              #    0.40  insns per cycle       
   ( +-  0.35% )
       2,719,869,660 branches                  #  240.382 M/sec                 
   ( +-  0.49% )
          27,950,304 branch-misses             #    1.03% of all branches       
   ( +-  0.39% )

  (46% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

As another end-to-end benchmark, I used YCSB, also using the same local
patches mentioned above so that it wasn't memory-throttling-bound. In
the YCSB "load" phase, the insertions are fully uniform-random, so we
don't expect to see performance benefits, but we do expect to see the
memory usage stay more consistent, and to not see any serious perf
regression. In practice, the results were as predicted, detailed in the
document linked below.

Various graphs from the benchmark runs are posted here:
  
https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.h
8 files changed, 231 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Reply via email to