----- On Sep 6, 2017, at 3:35 AM, Milian Wolff [email protected] wrote:

> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
>> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff [email protected] wrote:
>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
>> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff [email protected]
> wrote:
>> >> > Hey all,
>> >> > 
>> >> > I have noticed a very large overhead when linking against liblttng-ust:
>> >> > 
>> >> > ~~~~~
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ cat lttng-test.c
>> >> > int main()
>> >> > {
>> >> > 
>> >> >  return 0;
>> >> > 
>> >> > }
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl lttng-test.c
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  8.76% )
>> >> > 
>> >> >                 0      context-switches          #    0.000 K/sec
>> >> >                 0      cpu-migrations            #    0.000 K/sec
>> >> >                
>> >> >                49      page-faults               #    0.235 M/sec
>> >> > 
>> >> > ( +-  1.19% )
>> >> > 
>> >> >           706,854      cycles                    #    3.373 GHz
>> >> > 
>> >> > ( +-  8.82% )
>> >> > 
>> >> >           773,603      instructions              #    1.09  insn per
>> >> >           cycle
>> >> > 
>> >> > ( +-  0.75% )
>> >> > 
>> >> >           147,128      branches                  #  701.987 M/sec
>> >> > 
>> >> > ( +-  0.70% )
>> >> > 
>> >> >             4,096      branch-misses             #    2.78% of all
>> >> >             branches
>> >> > 
>> >> > ( +-  5.27% )
>> >> > 
>> >> >       0.000351422 seconds time elapsed
>> >> > 
>> >> > ( +- 11.85% )
>> >> > 
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf stat -r 5 ./a.out
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  1.37% )
>> >> > 
>> >> >                44      context-switches          #    0.021 M/sec
>> >> > 
>> >> > ( +-  1.95% )
>> >> > 
>> >> >                 2      cpu-migrations            #    0.776 K/sec
>> >> > 
>> >> > ( +- 25.00% )
>> >> > 
>> >> >               209      page-faults               #    0.101 M/sec
>> >> > 
>> >> > ( +-  0.34% )
>> >> > 
>> >> >         7,053,686      cycles                    #    3.419 GHz
>> >> > 
>> >> > ( +-  2.03% )
>> >> > 
>> >> >         6,893,783      instructions              #    0.98  insn per
>> >> >         cycle
>> >> > 
>> >> > ( +-  0.25% )
>> >> > 
>> >> >         1,342,492      branches                  #  650.735 M/sec
>> >> > 
>> >> > ( +-  0.20% )
>> >> > 
>> >> >            29,390      branch-misses             #    2.19% of all
>> >> >            branches
>> >> > 
>> >> > ( +-  0.61% )
>> >> > 
>> >> >       0.225597302 seconds time elapsed
>> >> > 
>> >> > ( +-  6.68% )
>> >> > ~~~~~
>> >> > 
>> >> > This is without any LTTng session configured. If I enable LTTng kernel
>> >> > and
>> >> > userspace events, this becomes even worse:
>> >> > 
>> >> > ~~~~~
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ cat $(which run_lttng_trace.sh)
>> >> > #!/bin/sh
>> >> > 
>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
>> >> > 
>> >> >    sudo lttng-sessiond --daemonize
>> >> > 
>> >> > fi
>> >> > 
>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c kernel -k -a
>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
>> >> > sudo lttng start
>> >> > $@
>> >> > sudo lttng stop
>> >> > 
>> >> > sudo chmod a+rx -R ~/lttng-traces
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
>> >> > Session auto-20170905-162818 created.
>> >> > Traces will be written in
>> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
>> >> > kernel
>> >> > enabled for session auto-20170905-162818
>> >> > All Kernel events are enabled in channel kernel
>> >> > UST channel ust enabled for session auto-20170905-162818
>> >> > UST event lttng_ust_tracef:* created in channel ust
>> >> > Tracing started for session auto-20170905-162818
>> >> > 
>> >> > Performance counter stats for './a.out' (5 runs):
>> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
>> >> >          utilized
>> >> > 
>> >> > ( +-  1.28% )
>> >> > 
>> >> >               117      context-switches          #    0.024 M/sec
>> >> > 
>> >> > ( +-  0.70% )
>> >> > 
>> >> >                 3      cpu-migrations            #    0.608 K/sec
>> >> > 
>> >> > ( +- 23.57% )
>> >> > 
>> >> >               245      page-faults               #    0.050 M/sec
>> >> > 
>> >> > ( +-  0.20% )
>> >> > 
>> >> >        16,541,355      cycles                    #    3.350 GHz
>> >> > 
>> >> > ( +-  0.94% )
>> >> > 
>> >> >        20,611,637      instructions              #    1.25  insn per
>> >> >        cycle
>> >> > 
>> >> > ( +-  0.23% )
>> >> > 
>> >> >         3,580,525      branches                  #  725.123 M/sec
>> >> > 
>> >> > ( +-  0.19% )
>> >> > 
>> >> >            52,093      branch-misses             #    1.45% of all
>> >> >            branches
>> >> > 
>> >> > ( +-  1.73% )
>> >> > 
>> >> >       0.965545882 seconds time elapsed
>> >> > 
>> >> > ( +- 16.96% )
>> >> > 
>> >> > Waiting for data availability...
>> >> > Tracing stopped for session auto-20170905-162818
>> >> > ~~~~~
>> >> > 
>> >> > I see this ~1s delay during startup in nearly all test applications
>> >> > where
>> >> > I
>> >> > wanted to use lttng-ust. In tracecompass, and according to perf sleep
>> >> > time
>> >> > profiling, this seems to be mostly due to extensive sleeping on mutex
>> >> > locks
>> >> > (from ust_lock in liblttng-ust). With perf trace, I also see membarrier
>> >> > taking a lot of time:
>> >> > 
>> >> > ~~~~~
>> >> > # without a running lttng session:
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ perf trace --duration 1 ./a.out
>> >> > 
>> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> msg:
>> >> > 0x7fbe2fbb1070                  ) = 1
>> >> > 
>> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
>> >> > 
>> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
>> >> >    WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> > # with a running lttng session:
>> >> > ┌milian@milian-kdab2:/tmp
>> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
>> >> > 
>> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
>> >> >     3<socket:[1185362]>,
>> > 
>> > msg:
>> >> > 0x7fcd81d8a0b0                  ) = 612
>> >> > 
>> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd: 3<socket:[1185362]>,
>> > 
>> > msg:
>> >> > 0x7fcd81d8a070                  ) = 1
>> >> > 
>> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
>> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
>> >> > 
>> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
>> >> >   WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> op:
>> >> > WAIT|PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0, op:
>> >> >   WAIT|
>> >> > 
>> >> > PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> op:
>> >> > WAIT|PRIV, val: 2                    ) = 0
>> >> > 
>> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
>> >> > 
>> >> > ) = 0
>> >> > 
>> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
>> >> > 
>> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0, 128)=22
>> >> > ~~~~~
>> >> > 
>> >> > This makes lttng-ust pretty much unusable for any short-lived
>> >> > application.
>> >> > Is this a known limitation?
>> >> 
>> >> This kind of delay is not completely unexpected, but a bit higher than I
>> >> would expect. I'm interested in figuring out if most of this delay
>> >> happens
>> >> to be caused by sys_membarrier, either directly, or indirectly (through a
>> >> lock).
>> > 
>> > See my other email which adds some harder numbers from a perf sleep
>> > record.
>> > 
>> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
>> >> > I've
>> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
>> >> > above measurements there. So far, I suspected the ust statedump to take
>> >> > too much time, but I disabled that one for the measurements above.
>> >> > 
>> >> > Any help would be appreciated
>> >> 
>> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
>> >> share the performance results ?
>> > 
>> > I'd have to recompile the kernel, which is something I don't like to do.
>> > Why is this required, or put differently - why would the kernel option
>> > influence the userspace LTTng library?
>> 
>> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
>> system call to speed up tracing, at the expense of slower execution of
>> tracing state update.
>> 
>> I'm currently proposing a new sys_membarrier command for PRIVATE_EXPEDITED
>> (single-process) use cases which should execute much faster than the
>> non-expedited SHARED command currently implemented.
>> 
>> liburcu 0.9 and 0.10 detects that this system call is available, and uses
>> it when it can.
> 
> May I ask whether you have such a configuration available? The test case I
> have submitted should be easy to reproduce for everyone. Compiling my own
> kernel version takes a huge amount of time which I'm quite frankly not willing
> to invest at this point.

We have reproduced the situation, and it's pointing towards our use of the
sys_membarrier system call. We'll try narrowing it down between its use by
lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).

You can try to patch liburcu so it won't try using sys_membarrier at all
(see attached patch)

> 
> I simply wonder whether anyone else is actually using lttng-ust, or if they
> are, whether they are only doing that for long-lived applications where this
> overhead is not noticeable?

It appears that this application startup slowdown mainly appears with
LTTng linked against liburcu 0.9 and 0.10 running with rather recent kernels.
This would explain why you are the first one to report this across our user
base.

Thanks,

Mathieu


> 
> Thanks
> 
> --
> Milian Wolff | [email protected] | Senior Software Engineer
> KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt Experts

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
diff --git a/src/urcu-bp.c b/src/urcu-bp.c
index c0fac83..2d8bcca 100644
--- a/src/urcu-bp.c
+++ b/src/urcu-bp.c
@@ -95,7 +95,7 @@ static
 int rcu_bp_refcount;
 
 /* If the headers do not support membarrier system call, fall back smp_mb. */
-#ifdef __NR_membarrier
+#if 0 //def __NR_membarrier
 # define membarrier(...)		syscall(__NR_membarrier, __VA_ARGS__)
 #else
 # define membarrier(...)		-ENOSYS
diff --git a/src/urcu.c b/src/urcu.c
index a81b129..1aafbc0 100644
--- a/src/urcu.c
+++ b/src/urcu.c
@@ -64,7 +64,7 @@
 #define RCU_QS_ACTIVE_ATTEMPTS 100
 
 /* If the headers do not support membarrier system call, fall back on RCU_MB */
-#ifdef __NR_membarrier
+#if 0 //def __NR_membarrier
 # define membarrier(...)		syscall(__NR_membarrier, __VA_ARGS__)
 #else
 # define membarrier(...)		-ENOSYS
_______________________________________________
lttng-dev mailing list
[email protected]
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

Reply via email to