Hello David Ribeiro Alves,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/6569

to review the following change.

Change subject: WIP: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................

WIP: KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread cache for use cases
like this. With the new batched Apply() path for writes, we can expect
that each subsequent operation is likely to hit the same rowset over and
over in a row, which means that we're likely to get a good hit rate even
with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

WIP: should run YCSB as a second end-to-end benchmark
WIP: need to add docs, etc, to code

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized         
   ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                 
   ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                 
   ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                 
   ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                   
   ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle       
   ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                 
   ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches       
   ( +-  0.33% )

         2.306319236 seconds time elapsed                                       
   ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11573.863720 task-clock                #    7.167 CPUs utilized         
   ( +-  0.91% )
             213,283 context-switches          #    0.018 M/sec                 
   ( +-  2.31% )
                  29 cpu-migrations            #    0.003 K/sec                 
   ( +- 10.75% )
               1,566 page-faults               #    0.135 K/sec                 
   ( +-  0.06% )
      33,399,534,315 cycles                    #    2.886 GHz                   
   ( +-  0.90% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,043,928,426 instructions              #    0.39  insns per cycle       
   ( +-  0.45% )
       2,722,013,296 branches                  #  235.186 M/sec                 
   ( +-  0.58% )
          27,412,912 branch-misses             #    1.01% of all branches       
   ( +-  0.79% )

         1.614814095 seconds time elapsed                                       
   ( +-  0.86% )

  (43% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

Various graphs from the benchmark runs are posted here:
  
https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/CMakeLists.txt
M src/kudu/util/bloom_filter.h
A src/kudu/util/threadlocal_cache.cc
A src/kudu/util/threadlocal_cache.h
8 files changed, 171 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>

Reply via email to