On Wednesday, September 6, 2017 9:57:11 PM CEST Mathieu Desnoyers wrote: > ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff [email protected] wrote: > > On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote: > >> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff [email protected] wrote: > >> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote: > >> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff [email protected] > > > > wrote: > >> >> > Hey all, > >> >> > > >> >> > I have noticed a very large overhead when linking against > >> >> > liblttng-ust: > >> >> > > >> >> > ~~~~~ > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ cat lttng-test.c > >> >> > int main() > >> >> > { > >> >> > > >> >> > return 0; > >> >> > > >> >> > } > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ gcc -O2 -g -ldl lttng-test.c > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ perf stat -r 5 ./a.out > >> >> > > >> >> > Performance counter stats for './a.out' (5 runs): > >> >> > 0.209587 task-clock (msec) # 0.596 CPUs > >> >> > utilized > >> >> > > >> >> > ( +- 8.76% ) > >> >> > > >> >> > 0 context-switches # 0.000 K/sec > >> >> > 0 cpu-migrations # 0.000 K/sec > >> >> > > >> >> > 49 page-faults # 0.235 M/sec > >> >> > > >> >> > ( +- 1.19% ) > >> >> > > >> >> > 706,854 cycles # 3.373 GHz > >> >> > > >> >> > ( +- 8.82% ) > >> >> > > >> >> > 773,603 instructions # 1.09 insn per > >> >> > cycle > >> >> > > >> >> > ( +- 0.75% ) > >> >> > > >> >> > 147,128 branches # 701.987 M/sec > >> >> > > >> >> > ( +- 0.70% ) > >> >> > > >> >> > 4,096 branch-misses # 2.78% of all > >> >> > branches > >> >> > > >> >> > ( +- 5.27% ) > >> >> > > >> >> > 0.000351422 seconds time elapsed > >> >> > > >> >> > ( +- 11.85% ) > >> >> > > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ perf stat -r 5 ./a.out > >> >> > > >> >> > Performance counter stats for './a.out' (5 runs): > >> >> > 2.063040 task-clock (msec) # 0.009 CPUs > >> >> > utilized > >> >> > > >> >> > ( +- 1.37% ) > >> >> > > >> >> > 44 context-switches # 0.021 M/sec > >> >> > > >> >> > ( +- 1.95% ) > >> >> > > >> >> > 2 cpu-migrations # 0.776 K/sec > >> >> > > >> >> > ( +- 25.00% ) > >> >> > > >> >> > 209 page-faults # 0.101 M/sec > >> >> > > >> >> > ( +- 0.34% ) > >> >> > > >> >> > 7,053,686 cycles # 3.419 GHz > >> >> > > >> >> > ( +- 2.03% ) > >> >> > > >> >> > 6,893,783 instructions # 0.98 insn per > >> >> > cycle > >> >> > > >> >> > ( +- 0.25% ) > >> >> > > >> >> > 1,342,492 branches # 650.735 M/sec > >> >> > > >> >> > ( +- 0.20% ) > >> >> > > >> >> > 29,390 branch-misses # 2.19% of all > >> >> > branches > >> >> > > >> >> > ( +- 0.61% ) > >> >> > > >> >> > 0.225597302 seconds time elapsed > >> >> > > >> >> > ( +- 6.68% ) > >> >> > ~~~~~ > >> >> > > >> >> > This is without any LTTng session configured. If I enable LTTng > >> >> > kernel > >> >> > and > >> >> > userspace events, this becomes even worse: > >> >> > > >> >> > ~~~~~ > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ cat $(which run_lttng_trace.sh) > >> >> > #!/bin/sh > >> >> > > >> >> > if [ -z "$(pidof lttng-sessiond)" ]; then > >> >> > > >> >> > sudo lttng-sessiond --daemonize > >> >> > > >> >> > fi > >> >> > > >> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds) > >> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8 > >> >> > sudo lttng enable-event -c kernel -k -a > >> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8 > >> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:* > >> >> > sudo lttng start > >> >> > $@ > >> >> > sudo lttng stop > >> >> > > >> >> > sudo chmod a+rx -R ~/lttng-traces > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out > >> >> > Session auto-20170905-162818 created. > >> >> > Traces will be written in > >> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel > >> >> > kernel > >> >> > enabled for session auto-20170905-162818 > >> >> > All Kernel events are enabled in channel kernel > >> >> > UST channel ust enabled for session auto-20170905-162818 > >> >> > UST event lttng_ust_tracef:* created in channel ust > >> >> > Tracing started for session auto-20170905-162818 > >> >> > > >> >> > Performance counter stats for './a.out' (5 runs): > >> >> > 4.937820 task-clock (msec) # 0.005 CPUs > >> >> > utilized > >> >> > > >> >> > ( +- 1.28% ) > >> >> > > >> >> > 117 context-switches # 0.024 M/sec > >> >> > > >> >> > ( +- 0.70% ) > >> >> > > >> >> > 3 cpu-migrations # 0.608 K/sec > >> >> > > >> >> > ( +- 23.57% ) > >> >> > > >> >> > 245 page-faults # 0.050 M/sec > >> >> > > >> >> > ( +- 0.20% ) > >> >> > > >> >> > 16,541,355 cycles # 3.350 GHz > >> >> > > >> >> > ( +- 0.94% ) > >> >> > > >> >> > 20,611,637 instructions # 1.25 insn per > >> >> > cycle > >> >> > > >> >> > ( +- 0.23% ) > >> >> > > >> >> > 3,580,525 branches # 725.123 M/sec > >> >> > > >> >> > ( +- 0.19% ) > >> >> > > >> >> > 52,093 branch-misses # 1.45% of all > >> >> > branches > >> >> > > >> >> > ( +- 1.73% ) > >> >> > > >> >> > 0.965545882 seconds time elapsed > >> >> > > >> >> > ( +- 16.96% ) > >> >> > > >> >> > Waiting for data availability... > >> >> > Tracing stopped for session auto-20170905-162818 > >> >> > ~~~~~ > >> >> > > >> >> > I see this ~1s delay during startup in nearly all test applications > >> >> > where > >> >> > I > >> >> > wanted to use lttng-ust. In tracecompass, and according to perf > >> >> > sleep > >> >> > time > >> >> > profiling, this seems to be mostly due to extensive sleeping on > >> >> > mutex > >> >> > locks > >> >> > (from ust_lock in liblttng-ust). With perf trace, I also see > >> >> > membarrier > >> >> > taking a lot of time: > >> >> > > >> >> > ~~~~~ > >> >> > # without a running lttng session: > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ perf trace --duration 1 ./a.out > >> >> > > >> >> > 6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>, > > > > msg: > >> >> > 0x7fbe2fbb1070 ) = 1 > >> >> > > >> >> > 5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op: > >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0 > >> >> > > >> >> > 59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op: > >> >> > WAIT| > >> >> > > >> >> > PRIV, val: 2 ) = 0 > >> >> > > >> >> > 179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > # with a running lttng session: > >> >> > ┌milian@milian-kdab2:/tmp > >> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out > >> >> > > >> >> > 6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd: > >> >> > 3<socket:[1185362]>, > >> > > >> > msg: > >> >> > 0x7fcd81d8a0b0 ) = 612 > >> >> > > >> >> > 400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: > >> >> > 3<socket:[1185362]>, > >> > > >> > msg: > >> >> > 0x7fcd81d8a070 ) = 1 > >> >> > > >> >> > 6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op: > >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0 > >> >> > > >> >> > 610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op: > >> >> > WAIT| > >> >> > > >> >> > PRIV, val: 2 ) = 0 > >> >> > > >> >> > 616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960, > > > > op: > >> >> > WAIT|PRIV, val: 2 ) = 0 > >> >> > > >> >> > 738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, > >> >> > op: > >> >> > WAIT| > >> >> > > >> >> > PRIV, val: 2 ) = 0 > >> >> > > >> >> > 838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, > > > > op: > >> >> > WAIT|PRIV, val: 2 ) = 0 > >> >> > > >> >> > 958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1 > >> >> > > >> >> > ) = 0 > >> >> > > >> >> > 6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000 > >> >> > > >> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, > >> >> > 128)=22 > >> >> > ~~~~~ > >> >> > > >> >> > This makes lttng-ust pretty much unusable for any short-lived > >> >> > application. > >> >> > Is this a known limitation? > >> >> > >> >> This kind of delay is not completely unexpected, but a bit higher than > >> >> I > >> >> would expect. I'm interested in figuring out if most of this delay > >> >> happens > >> >> to be caused by sys_membarrier, either directly, or indirectly > >> >> (through a > >> >> lock). > >> > > >> > See my other email which adds some harder numbers from a perf sleep > >> > record. > >> > > >> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4. > >> >> > I've > >> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the > >> >> > above measurements there. So far, I suspected the ust statedump to > >> >> > take > >> >> > too much time, but I disabled that one for the measurements above. > >> >> > > >> >> > Any help would be appreciated > >> >> > >> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and > >> >> share the performance results ? > >> > > >> > I'd have to recompile the kernel, which is something I don't like to > >> > do. > >> > Why is this required, or put differently - why would the kernel option > >> > influence the userspace LTTng library? > >> > >> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier > >> system call to speed up tracing, at the expense of slower execution of > >> tracing state update. > >> > >> I'm currently proposing a new sys_membarrier command for > >> PRIVATE_EXPEDITED > >> (single-process) use cases which should execute much faster than the > >> non-expedited SHARED command currently implemented. > >> > >> liburcu 0.9 and 0.10 detects that this system call is available, and uses > >> it when it can. > > > > May I ask whether you have such a configuration available? The test case I > > have submitted should be easy to reproduce for everyone. Compiling my own > > kernel version takes a huge amount of time which I'm quite frankly not > > willing to invest at this point. > > We have reproduced the situation, and it's pointing towards our use of the > sys_membarrier system call. We'll try narrowing it down between its use by > lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so). > > You can try to patch liburcu so it won't try using sys_membarrier at all > (see attached patch)
It does improve the situation tremendously. I just compiled v0.10.0-6-g6447802 with and without the patch to record the numbers below. Note that apparently your latest patch [1] in git master alone is not enough to fix the performance bug, I do need to apply the patch you send by mail, too. [1]: https://github.com/urcu/userspace-rcu/commit/ 64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6 But once I also apply the patch you send by mail, things start to fly. In the below, "before" means running against userspace-rcu v0.10.0-6-g6447802, "after" means running against the same version with your emailed patch applied on top. Measurement without an active LTTng session: ~~~~~ # before: ┌milian@milian-kdab2:/tmp └$ perf stat -r 5 ./a.out Performance counter stats for './a.out' (5 runs): 3.973288 task-clock (msec) # 0.005 CPUs utilized ( +- 0.39% ) 153 context-switches # 0.039 M/sec ( +- 1.43% ) 2 cpu-migrations # 0.554 K/sec ( +- 26.50% ) 315 page-faults # 0.079 M/sec ( +- 0.21% ) 13,257,360 cycles # 3.337 GHz ( +- 0.34% ) 14,906,461 instructions # 1.12 insn per cycle ( +- 0.11% ) 2,910,368 branches # 732.483 M/sec ( +- 0.10% ) 42,766 branch-misses # 1.47% of all branches ( +- 0.89% ) 0.851654352 seconds time elapsed ( +- 4.33% ) # after: ┌milian@milian-kdab2:/tmp └$ perf stat -r 5 ./a.out Performance counter stats for './a.out' (5 runs): 3.608242 task-clock (msec) # 0.056 CPUs utilized ( +- 1.84% ) 138 context-switches # 0.038 M/sec ( +- 2.70% ) 1 cpu-migrations # 0.277 K/sec 313 page-faults # 0.087 M/sec ( +- 0.27% ) 12,022,499 cycles # 3.332 GHz ( +- 1.73% ) 14,861,318 instructions # 1.24 insn per cycle ( +- 0.32% ) 2,902,920 branches # 804.525 M/sec ( +- 0.32% ) 38,264 branch-misses # 1.32% of all branches ( +- 0.78% ) 0.063929538 seconds time elapsed ( +- 10.93% ) ~~~~~ With an active LTTng session: ~~~~~ # before: Performance counter stats for './a.out' (5 runs): 9.152925 task-clock (msec) # 0.005 CPUs utilized ( +- 1.13% ) 216 context-switches # 0.024 M/sec ( +- 2.83% ) 4 cpu-migrations # 0.459 K/sec ( +- 11.66% ) 370 page-faults # 0.040 M/sec ( +- 0.20% ) 31,311,283 cycles # 3.421 GHz ( +- 0.69% ) 40,509,922 instructions # 1.29 insn per cycle ( +- 0.40% ) 7,192,651 branches # 785.831 M/sec ( +- 0.33% ) 90,264 branch-misses # 1.25% of all branches ( +- 1.46% ) 1.839545247 seconds time elapsed ( +- 6.95% ) # after: Performance counter stats for './a.out' (5 runs): 7.052094 task-clock (msec) # 0.026 CPUs utilized ( +- 1.45% ) 190 context-switches # 0.027 M/sec ( +- 4.88% ) 1 cpu-migrations # 0.142 K/sec ( +- 44.72% ) 335 page-faults # 0.048 M/sec ( +- 0.37% ) 24,105,408 cycles # 3.418 GHz ( +- 1.39% ) 35,100,950 instructions # 1.46 insn per cycle ( +- 0.17% ) 6,197,972 branches # 878.884 M/sec ( +- 0.15% ) 56,045 branch-misses # 0.90% of all branches ( +- 1.41% ) 0.272121270 seconds time elapsed ( +- 52.41% ) ~~~~~ > > I simply wonder whether anyone else is actually using lttng-ust, or if > > they > > are, whether they are only doing that for long-lived applications where > > this overhead is not noticeable? > > It appears that this application startup slowdown mainly appears with > LTTng linked against liburcu 0.9 and 0.10 running with rather recent > kernels. This would explain why you are the first one to report this across > our user base. Thank you, that explains it. -- Milian Wolff | [email protected] | Senior Software Engineer KDAB (Deutschland) GmbH&Co KG, a KDAB Group company Tel: +49-30-521325470 KDAB - The Qt Experts
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ lttng-dev mailing list [email protected] https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
