Todd Lipcon has submitted this change and it was merged. (
http://gerrit.cloudera.org:8080/9108 )
Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................
Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.
The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
base::SpinLock from gutil and tcmalloc's. This conflict has always
existed, but apparently tcmalloc's implementation changed sizes
slightly, meaning that now the linker complains about the duplicate
symbol definition. I created a simple find-replace patch to change
tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
longer necessary.
- for unknown reasons, gperftools now fails to build with TSAN
instrumentation: it shows duplicate symbols between the tcmalloc
'operator new' and clang_rt's 'operator new'. We never actually needed
to use gperftools in TSAN builds anyway, so this just removes the
instrumented build and properly ifdefs out a few of the places where
we'd included gperftools headers unconditionally.
Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.
Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.
I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:
Before:
Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':
322623.135418 task-clock (msec) # 4.964 CPUs utilized
11,751,640 context-switches # 0.036 M/sec
3,619,010 cpu-migrations # 0.011 M/sec
117,706 page-faults # 0.365 K/sec
1,057,656,296,542 cycles # 3.278 GHz
695,668,728,662 instructions # 0.66 insn per cycle
127,565,245,327 branches # 395.400 M/sec
1,480,987,998 branch-misses # 1.16% of all branches
64.996506196 seconds time elapsed
330206.351604 task-clock (msec) # 5.330 CPUs utilized
12,518,370 context-switches # 0.038 M/sec
3,695,790 cpu-migrations # 0.011 M/sec
113,580 page-faults # 0.344 K/sec
1,059,061,436,907 cycles # 3.207 GHz
697,782,051,928 instructions # 0.66 insn per cycle
127,949,774,596 branches # 387.484 M/sec
1,371,967,917 branch-misses # 1.07% of all branches
61.952217532 seconds time elapsed
After:
310788.334202 task-clock (msec) # 5.315 CPUs utilized
12,486,276 context-switches # 0.040 M/sec
3,690,660 cpu-migrations # 0.012 M/sec
113,988 page-faults # 0.367 K/sec
1,027,033,932,375 cycles # 3.305 GHz
663,508,168,985 instructions # 0.65 insn per cycle
125,864,966,052 branches # 404.986 M/sec
1,442,683,495 branch-misses # 1.15% of all branches
58.479033228 seconds time elapsed
Performance counter stats for
'build/latest/bin/full_stack-insert-scan-test.263
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':
317505.548908 task-clock (msec) # 5.323 CPUs utilized
12,461,003 context-switches # 0.039 M/sec
3,688,491 cpu-migrations # 0.012 M/sec
113,952 page-faults # 0.359 K/sec
1,029,595,155,391 cycles # 3.243 GHz
663,514,269,656 instructions # 0.64 insn per cycle
125,865,138,975 branches # 396.419 M/sec
1,419,717,877 branch-misses # 1.13% of all branches
59.642651558 seconds time elapsed
Seems like a small improvement in cycle count, wall clock, etc.
Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Reviewed-on: http://gerrit.cloudera.org:8080/9108
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert <[email protected]>
---
M CMakeLists.txt
M src/kudu/server/default_path_handlers.cc
M src/kudu/server/pprof_path_handlers.cc
M src/kudu/server/tcmalloc_metrics.cc
M src/kudu/util/mem_tracker.cc
M src/kudu/util/process_memory.cc
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
D
thirdparty/patches/gperftools-Change-default-TCMALLOC_TRANSFER_NUM_OBJ-to-40.patch
A
thirdparty/patches/gperftools-Replace-namespace-base-with-namespace-tcmalloc.patch
D
thirdparty/patches/gperftools-hook-mi_force_unlock-on-OSX-instead-of-pthread_atfork.patch
D
thirdparty/patches/gperftools-issue-827-add_get_default_zone_to_osx_libc_override.patch
M thirdparty/vars.sh
13 files changed, 1,704 insertions(+), 165 deletions(-)
Approvals:
Kudu Jenkins: Verified
Dan Burkert: Looks good to me, approved
--
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <[email protected]>