LTTng-UST vs SystemTap userspace tracing benchmarks February 15th, 2011
Authors: Mathieu Desnoyers <[email protected]> Julien Desfossez <[email protected]> -- Introduction The purpose of this benchmark is to compare the performance for userspace tracing of SystemTap and LTTng-UST. The goal is to show that the two tools are complementary since SystemTap doesn't seem to be able to handle tracing applications with a high throughput of trace data. -- Benchmark 10 million events generated per thread, number of threads vary. Each event generates a time-stamp and contains a 4-byte integer value. Synthetic workload: cache-hot test, function writing events called in loop. On a 8-core Intel Xeon, (2x 4-core), E5405 at 2.0GHz, 16GB ram Running Linux 2.6.37 (custom build, with utrace patches, debuginfo enabled and LTTng trace clock available) UST 0.11, hooking on user-space Tracepoints * UST tuning : Normal (blocking) mode, 16 buffers, 4k each * We test UST with the LTTng Trace Clock (w/ TC) and with the standard clock infrastructure (w/o TC) SystemTap 1.2-5 (from Debian package), hooking on DTrace user-space static markup. * SystemTap probe (stap testutrace.stp -F) : probe process("./.libs/tracepoint_benchmark").mark("single_trace") { printf("%d : %s\n", gettimeofday_ns(), $arg1); } -- Results 0) Baseline : running the program without any instrumentation TOTAL CPU TIME Number of threads baseline 1 0:0.33 2 0:0.33 4 0:0.33 8 0:0.33 1) Flight recorder tracing comparison UST vs SystemTap TOTAL CPU TIME Number of threads UST w/ TC UST w/o TC SystemTap 1 0:01.81 0:02.25 0:58.36 2 0:01.86 0:02.13 1:49.94 4 0:01.86 0:02.22 2:38.49 8 0:01.97 0:02.14 9:29.58 TOTAL CPU TIME (ns/event) Number of threads UST w/ TC UST w/o TC SystemTap 1 181 225 5836 2 186 213 10994 4 186 222 15849 8 197 204 56958 UST SPEEDUP Number of threads UST w/ TC UST w/o TC 1 32x 25x 2 59x 51x 4 85x 71x 8 289x 279x 2) Tracing to disk comparison UST vs SystemTap (trace output fits in page cache) TOTAL CPU TIME Number of threads UST w/ TC UST w/o TC SystemTap 1 0:01.82 0:02.11 1:01.12 (128622 lost) 2 0:01.95 0:02.14 1:44.20 (397859 lost) 4 0:01.97 0:02.31 2:38.13 (360549 lost) 8 0:02.28 0:02.68 9:29.36 (158538 lost) TOTAL CPU TIME (ns/event) Number of threads UST w/ TC UST w/o TC SystemTap 1 182 211 6112 2 195 214 10420 4 197 231 15813 8 228 268 56936 UST SPEEDUP Number of threads UST w/ TC UST w/o TC 1 33x 28x 2 53x 48x 4 80x 68x 8 249x 212x OUTPUT SIZE (MB) Number of threads UST SystemTap UST Output compression 1 77 271 3.52 2 153 554 3.62 4 306 1097 3.58 8 612 2214 3.61 -- Conclusions For flight recorder tracing, UST is 289 times faster than SystemTap on an 8-core system with a LTTng kernel and 279 times with a vanilla+utrace kernel. When recording traces to disk, UST is 249 times faster than SystemTap on an 8-core system with a LTTng kernel and 212 times with a vanilla+utrace kernel. Only a small part of the UST speedup over SystemTap is due to the more compressed size of its output (binary for UST vs text for SystemTap). SystemTap does not scale for multithreaded applications running on multi-core systems. UST scales linearly with the number of cores for flight recorder tracing, and almost linearly when saving tracing output to the page cache. This study proves that LTTng-UST and SystemTap are two tools with a complementary purpose. LTTng-UST is more efficient in extracting a high volume of trace data which allows a developper or a system engineer to diagnose an unknown problem, whereas SystemTap is more targetted to provide a quick interface for instrumenting specific problems. _______________________________________________ ltt-dev mailing list [email protected] http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
