Hello Dan Burkert,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/9108

to review the following change.


Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................

Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.

The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
  base::SpinLock from gutil and tcmalloc's. This conflict has always
  existed, but apparently tcmalloc's implementation changed sizes
  slightly, meaning that now the linker complains about the duplicate
  symbol definition. I created a simple find-replace patch to change
  tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
  longer necessary.

Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.

Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.

I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:

Before:
 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     322623.135418      task-clock (msec)         #    4.964 CPUs utilized
        11,751,640      context-switches          #    0.036 M/sec
         3,619,010      cpu-migrations            #    0.011 M/sec
           117,706      page-faults               #    0.365 K/sec
 1,057,656,296,542      cycles                    #    3.278 GHz
   695,668,728,662      instructions              #    0.66  insn per cycle
   127,565,245,327      branches                  #  395.400 M/sec
     1,480,987,998      branch-misses             #    1.16% of all branches

      64.996506196 seconds time elapsed

     330206.351604      task-clock (msec)         #    5.330 CPUs utilized
        12,518,370      context-switches          #    0.038 M/sec
         3,695,790      cpu-migrations            #    0.011 M/sec
           113,580      page-faults               #    0.344 K/sec
 1,059,061,436,907      cycles                    #    3.207 GHz
   697,782,051,928      instructions              #    0.66  insn per cycle
   127,949,774,596      branches                  #  387.484 M/sec
     1,371,967,917      branch-misses             #    1.07% of all branches

      61.952217532 seconds time elapsed

After:

     310788.334202      task-clock (msec)         #    5.315 CPUs utilized
        12,486,276      context-switches          #    0.040 M/sec
         3,690,660      cpu-migrations            #    0.012 M/sec
           113,988      page-faults               #    0.367 K/sec
 1,027,033,932,375      cycles                    #    3.305 GHz
   663,508,168,985      instructions              #    0.65  insn per cycle
   125,864,966,052      branches                  #  404.986 M/sec
     1,442,683,495      branch-misses             #    1.15% of all branches

      58.479033228 seconds time elapsed

 Performance counter stats for 
'build/latest/bin/full_stack-insert-scan-test.263 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     317505.548908      task-clock (msec)         #    5.323 CPUs utilized
        12,461,003      context-switches          #    0.039 M/sec
         3,688,491      cpu-migrations            #    0.012 M/sec
           113,952      page-faults               #    0.359 K/sec
 1,029,595,155,391      cycles                    #    3.243 GHz
   663,514,269,656      instructions              #    0.64  insn per cycle
   125,865,138,975      branches                  #  396.419 M/sec
     1,419,717,877      branch-misses             #    1.13% of all branches

      59.642651558 seconds time elapsed

Seems like a small improvement in cycle count, wall clock, etc.

Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
---
M CMakeLists.txt
M thirdparty/download-thirdparty.sh
D 
thirdparty/patches/gperftools-Change-default-TCMALLOC_TRANSFER_NUM_OBJ-to-40.patch
A 
thirdparty/patches/gperftools-Replace-namespace-base-with-namespace-tcmalloc.patch
D 
thirdparty/patches/gperftools-hook-mi_force_unlock-on-OSX-instead-of-pthread_atfork.patch
D 
thirdparty/patches/gperftools-issue-827-add_get_default_zone_to_osx_libc_override.patch
M thirdparty/vars.sh
7 files changed, 1,696 insertions(+), 159 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/9108/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>

Reply via email to