Re: [lttng-dev] large liblttng-ust startup overhead (ust_lock)

Milian Wolff Thu, 07 Sep 2017 00:51:08 -0700

On Wednesday, September 6, 2017 9:57:11 PM CEST Mathieu Desnoyers wrote:
> ----- On Sep 6, 2017, at 3:35 AM, Milian Wolff [email protected] wrote:
> > On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> >> ----- On Sep 5, 2017, at 11:08 AM, Milian Wolff [email protected] 
wrote:
> >> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >> >> ----- On Sep 5, 2017, at 10:34 AM, Milian Wolff [email protected]
> > 
> > wrote:
> >> >> > Hey all,
> >> >> > 
> >> >> > I have noticed a very large overhead when linking against
> >> >> > liblttng-ust:
> >> >> > 
> >> >> > ~~~~~
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ cat lttng-test.c
> >> >> > int main()
> >> >> > {
> >> >> > 
> >> >> >  return 0;
> >> >> > 
> >> >> > }
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf stat -r 5 ./a.out
> >> >> > 
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> >          0.209587      task-clock (msec)         #    0.596 CPUs
> >> >> >          utilized
> >> >> > 
> >> >> > ( +-  8.76% )
> >> >> > 
> >> >> >                 0      context-switches          #    0.000 K/sec
> >> >> >                 0      cpu-migrations            #    0.000 K/sec
> >> >> >                
> >> >> >                49      page-faults               #    0.235 M/sec
> >> >> > 
> >> >> > ( +-  1.19% )
> >> >> > 
> >> >> >           706,854      cycles                    #    3.373 GHz
> >> >> > 
> >> >> > ( +-  8.82% )
> >> >> > 
> >> >> >           773,603      instructions              #    1.09  insn per
> >> >> >           cycle
> >> >> > 
> >> >> > ( +-  0.75% )
> >> >> > 
> >> >> >           147,128      branches                  #  701.987 M/sec
> >> >> > 
> >> >> > ( +-  0.70% )
> >> >> > 
> >> >> >             4,096      branch-misses             #    2.78% of all
> >> >> >             branches
> >> >> > 
> >> >> > ( +-  5.27% )
> >> >> > 
> >> >> >       0.000351422 seconds time elapsed
> >> >> > 
> >> >> > ( +- 11.85% )
> >> >> > 
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf stat -r 5 ./a.out
> >> >> > 
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> >          2.063040      task-clock (msec)         #    0.009 CPUs
> >> >> >          utilized
> >> >> > 
> >> >> > ( +-  1.37% )
> >> >> > 
> >> >> >                44      context-switches          #    0.021 M/sec
> >> >> > 
> >> >> > ( +-  1.95% )
> >> >> > 
> >> >> >                 2      cpu-migrations            #    0.776 K/sec
> >> >> > 
> >> >> > ( +- 25.00% )
> >> >> > 
> >> >> >               209      page-faults               #    0.101 M/sec
> >> >> > 
> >> >> > ( +-  0.34% )
> >> >> > 
> >> >> >         7,053,686      cycles                    #    3.419 GHz
> >> >> > 
> >> >> > ( +-  2.03% )
> >> >> > 
> >> >> >         6,893,783      instructions              #    0.98  insn per
> >> >> >         cycle
> >> >> > 
> >> >> > ( +-  0.25% )
> >> >> > 
> >> >> >         1,342,492      branches                  #  650.735 M/sec
> >> >> > 
> >> >> > ( +-  0.20% )
> >> >> > 
> >> >> >            29,390      branch-misses             #    2.19% of all
> >> >> >            branches
> >> >> > 
> >> >> > ( +-  0.61% )
> >> >> > 
> >> >> >       0.225597302 seconds time elapsed
> >> >> > 
> >> >> > ( +-  6.68% )
> >> >> > ~~~~~
> >> >> > 
> >> >> > This is without any LTTng session configured. If I enable LTTng
> >> >> > kernel
> >> >> > and
> >> >> > userspace events, this becomes even worse:
> >> >> > 
> >> >> > ~~~~~
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ cat $(which run_lttng_trace.sh)
> >> >> > #!/bin/sh
> >> >> > 
> >> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >> >> > 
> >> >> >    sudo lttng-sessiond --daemonize
> >> >> > 
> >> >> > fi
> >> >> > 
> >> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >> >> > sudo lttng enable-event -c kernel -k -a
> >> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >> >> > sudo lttng start
> >> >> > $@
> >> >> > sudo lttng stop
> >> >> > 
> >> >> > sudo chmod a+rx -R ~/lttng-traces
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >> >> > Session auto-20170905-162818 created.
> >> >> > Traces will be written in
> >> >> > /home/milian/lttng-traces/2017-09-05T16:28:18+02:00 Kernel channel
> >> >> > kernel
> >> >> > enabled for session auto-20170905-162818
> >> >> > All Kernel events are enabled in channel kernel
> >> >> > UST channel ust enabled for session auto-20170905-162818
> >> >> > UST event lttng_ust_tracef:* created in channel ust
> >> >> > Tracing started for session auto-20170905-162818
> >> >> > 
> >> >> > Performance counter stats for './a.out' (5 runs):
> >> >> >          4.937820      task-clock (msec)         #    0.005 CPUs
> >> >> >          utilized
> >> >> > 
> >> >> > ( +-  1.28% )
> >> >> > 
> >> >> >               117      context-switches          #    0.024 M/sec
> >> >> > 
> >> >> > ( +-  0.70% )
> >> >> > 
> >> >> >                 3      cpu-migrations            #    0.608 K/sec
> >> >> > 
> >> >> > ( +- 23.57% )
> >> >> > 
> >> >> >               245      page-faults               #    0.050 M/sec
> >> >> > 
> >> >> > ( +-  0.20% )
> >> >> > 
> >> >> >        16,541,355      cycles                    #    3.350 GHz
> >> >> > 
> >> >> > ( +-  0.94% )
> >> >> > 
> >> >> >        20,611,637      instructions              #    1.25  insn per
> >> >> >        cycle
> >> >> > 
> >> >> > ( +-  0.23% )
> >> >> > 
> >> >> >         3,580,525      branches                  #  725.123 M/sec
> >> >> > 
> >> >> > ( +-  0.19% )
> >> >> > 
> >> >> >            52,093      branch-misses             #    1.45% of all
> >> >> >            branches
> >> >> > 
> >> >> > ( +-  1.73% )
> >> >> > 
> >> >> >       0.965545882 seconds time elapsed
> >> >> > 
> >> >> > ( +- 16.96% )
> >> >> > 
> >> >> > Waiting for data availability...
> >> >> > Tracing stopped for session auto-20170905-162818
> >> >> > ~~~~~
> >> >> > 
> >> >> > I see this ~1s delay during startup in nearly all test applications
> >> >> > where
> >> >> > I
> >> >> > wanted to use lttng-ust. In tracecompass, and according to perf
> >> >> > sleep
> >> >> > time
> >> >> > profiling, this seems to be mostly due to extensive sleeping on
> >> >> > mutex
> >> >> > locks
> >> >> > (from ust_lock in liblttng-ust). With perf trace, I also see
> >> >> > membarrier
> >> >> > taking a lot of time:
> >> >> > 
> >> >> > ~~~~~
> >> >> > # without a running lttng session:
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ perf trace --duration 1 ./a.out
> >> >> > 
> >> >> >     6.492 (52.468 ms): a.out/23672 recvmsg(fd: 3<socket:[1178439]>,
> > 
> > msg:
> >> >> > 0x7fbe2fbb1070                  ) = 1
> >> >> > 
> >> >> >     5.077 (54.271 ms): a.out/23671 futex(uaddr: 0x7fbe30d508a0, op:
> >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7ffc474ff5a0, val3: 4294967295) = 0
> >> >> > 
> >> >> >    59.598 (79.379 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   138.984 (39.945 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >    59.621 (119.324 ms): a.out/23672 futex(uaddr: 0x7fbe303e54c0, op:
> >> >> >    WAIT|
> >> >> > 
> >> >> > PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   179.045 (79.918 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   258.971 (39.997 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   299.052 (79.883 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   378.942 (59.988 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   439.022 (69.908 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   508.937 (80.027 ms): a.out/23671 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> > # with a running lttng session:
> >> >> > ┌milian@milian-kdab2:/tmp
> >> >> > └$ run_lttng_trace.sh perf trace --duration 1 ./a.out
> >> >> > 
> >> >> >     6.666 (392.496 ms): a.out-ust/23790 recvmsg(fd:
> >> >> >     3<socket:[1185362]>,
> >> > 
> >> > msg:
> >> >> > 0x7fcd81d8a0b0                  ) = 612
> >> >> > 
> >> >> >   400.238 (68.452 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   468.698 (39.983 ms): a.out-ust/23790 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   510.567 (98.113 ms): a.out-ust/23790 recvmsg(fd:
> >> >> >   3<socket:[1185362]>,
> >> > 
> >> > msg:
> >> >> > 0x7fcd81d8a070                  ) = 1
> >> >> > 
> >> >> >     6.467 (603.500 ms): a.out/23789 futex(uaddr: 0x7fcd82f298a0, op:
> >> >> > WAIT_BITSET|PRIV|CLKRT, utime: 0x7fff4df6c970, val3: 4294967295) = 0
> >> >> > 
> >> >> >   610.020 ( 6.307 ms): a.out/23789 futex(uaddr: 0x7fcd82f29960, op:
> >> >> >   WAIT|
> >> >> > 
> >> >> > PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   616.446 (62.265 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   678.721 (59.916 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   616.332 (122.335 ms): a.out-ust/23790 futex(uaddr: 0x7fcd82f29960,
> > 
> > op:
> >> >> > WAIT|PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   738.976 (49.658 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   788.642 (49.987 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   739.014 (99.644 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> >> >> >   op:
> >> >> >   WAIT|
> >> >> > 
> >> >> > PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   838.651 (49.977 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   888.636 (69.983 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >   838.662 (119.978 ms): a.out-ust/23790 futex(uaddr: 0x7fcd825be4c0,
> > 
> > op:
> >> >> > WAIT|PRIV, val: 2                    ) = 0
> >> >> > 
> >> >> >   958.739 (59.873 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1018.621 (80.058 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1098.782 (49.844 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1148.634 (59.998 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1208.731 (69.940 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1278.680 (79.947 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1358.727 (39.881 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >  1398.616 (39.980 ms): a.out/23789 membarrier(cmd: 1
> >> >> > 
> >> >> > ) = 0
> >> >> > 
> >> >> >     6.640 (1432.157 ms): a.out-ust/23791 futex(uaddr: 0x7fcd82f90000
> >> >> > 
> >> >> > ) = -1 (null) INTERNAL ERROR: strerror_r(512, 0x7ffdbc5467a0,
> >> >> > 128)=22
> >> >> > ~~~~~
> >> >> > 
> >> >> > This makes lttng-ust pretty much unusable for any short-lived
> >> >> > application.
> >> >> > Is this a known limitation?
> >> >> 
> >> >> This kind of delay is not completely unexpected, but a bit higher than
> >> >> I
> >> >> would expect. I'm interested in figuring out if most of this delay
> >> >> happens
> >> >> to be caused by sys_membarrier, either directly, or indirectly
> >> >> (through a
> >> >> lock).
> >> > 
> >> > See my other email which adds some harder numbers from a perf sleep
> >> > record.
> >> > 
> >> >> > I'm using lttng-ust on Arch from the AUR packages, i.e. LTTng 2.9.4.
> >> >> > I've
> >> >> > seen this problem also on Ubuntu 17.04 though I didn't do any of the
> >> >> > above measurements there. So far, I suspected the ust statedump to
> >> >> > take
> >> >> > too much time, but I disabled that one for the measurements above.
> >> >> > 
> >> >> > Any help would be appreciated
> >> >> 
> >> >> Can you reproduce with a Linux kernel that has CONFIG_MEMBARRIER=n and
> >> >> share the performance results ?
> >> > 
> >> > I'd have to recompile the kernel, which is something I don't like to
> >> > do.
> >> > Why is this required, or put differently - why would the kernel option
> >> > influence the userspace LTTng library?
> >> 
> >> liburcu 0.9.x and 0.10.x (a dependency of lttng-ust) uses the membarrier
> >> system call to speed up tracing, at the expense of slower execution of
> >> tracing state update.
> >> 
> >> I'm currently proposing a new sys_membarrier command for
> >> PRIVATE_EXPEDITED
> >> (single-process) use cases which should execute much faster than the
> >> non-expedited SHARED command currently implemented.
> >> 
> >> liburcu 0.9 and 0.10 detects that this system call is available, and uses
> >> it when it can.
> > 
> > May I ask whether you have such a configuration available? The test case I
> > have submitted should be easy to reproduce for everyone. Compiling my own
> > kernel version takes a huge amount of time which I'm quite frankly not
> > willing to invest at this point.
> 
> We have reproduced the situation, and it's pointing towards our use of the
> sys_membarrier system call. We'll try narrowing it down between its use by
> lttng-ust (liburcu-bp.so) or by lttng-sessiond (liburcu.so).
> 
> You can try to patch liburcu so it won't try using sys_membarrier at all
> (see attached patch)


It does improve the situation tremendously. I just compiled v0.10.0-6-g6447802 
with and without the patch to record the numbers below. Note that apparently 
your latest patch [1] in git master alone is not enough to fix the performance 
bug, I do need to apply the patch you send by mail, too.

[1]: https://github.com/urcu/userspace-rcu/commit/
64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6

But once I also apply the patch you send by mail, things start to fly. In the 
below, "before" means running against userspace-rcu v0.10.0-6-g6447802, 
"after" means running against the same version with your emailed patch applied 
on top.

Measurement without an active LTTng session:

~~~~~
# before:
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out

 Performance counter stats for './a.out' (5 runs):

          3.973288      task-clock (msec)         #    0.005 CPUs utilized      
      
( +-  0.39% )
               153      context-switches          #    0.039 M/sec              
      
( +-  1.43% )
                 2      cpu-migrations            #    0.554 K/sec              
      
( +- 26.50% )
               315      page-faults               #    0.079 M/sec              
      
( +-  0.21% )
        13,257,360      cycles                    #    3.337 GHz                
      
( +-  0.34% )
        14,906,461      instructions              #    1.12  insn per cycle     
      
( +-  0.11% )
         2,910,368      branches                  #  732.483 M/sec              
      
( +-  0.10% )
            42,766      branch-misses             #    1.47% of all branches    
      
( +-  0.89% )

       0.851654352 seconds time elapsed                                         
 
( +-  4.33% )

# after:
┌milian@milian-kdab2:/tmp
└$ perf stat -r 5 ./a.out

 Performance counter stats for './a.out' (5 runs):

          3.608242      task-clock (msec)         #    0.056 CPUs utilized      
      
( +-  1.84% )
               138      context-switches          #    0.038 M/sec              
      
( +-  2.70% )
                 1      cpu-migrations            #    0.277 K/sec              
    
               313      page-faults               #    0.087 M/sec              
      
( +-  0.27% )
        12,022,499      cycles                    #    3.332 GHz                
      
( +-  1.73% )
        14,861,318      instructions              #    1.24  insn per cycle     
      
( +-  0.32% )
         2,902,920      branches                  #  804.525 M/sec              
      
( +-  0.32% )
            38,264      branch-misses             #    1.32% of all branches    
      
( +-  0.78% )

       0.063929538 seconds time elapsed                                         
 
( +- 10.93% )
~~~~~

With an active LTTng session:

~~~~~
# before:
 Performance counter stats for './a.out' (5 runs):

          9.152925      task-clock (msec)         #    0.005 CPUs utilized      
      
( +-  1.13% )
               216      context-switches          #    0.024 M/sec              
      
( +-  2.83% )
                 4      cpu-migrations            #    0.459 K/sec              
      
( +- 11.66% )
               370      page-faults               #    0.040 M/sec              
      
( +-  0.20% )
        31,311,283      cycles                    #    3.421 GHz                
      
( +-  0.69% )
        40,509,922      instructions              #    1.29  insn per cycle     
      
( +-  0.40% )
         7,192,651      branches                  #  785.831 M/sec              
      
( +-  0.33% )
            90,264      branch-misses             #    1.25% of all branches    
      
( +-  1.46% )

       1.839545247 seconds time elapsed                                         
 
( +-  6.95% )

# after:
 Performance counter stats for './a.out' (5 runs):

          7.052094      task-clock (msec)         #    0.026 CPUs utilized      
      
( +-  1.45% )
               190      context-switches          #    0.027 M/sec              
      
( +-  4.88% )
                 1      cpu-migrations            #    0.142 K/sec              
      
( +- 44.72% )
               335      page-faults               #    0.048 M/sec              
      
( +-  0.37% )
        24,105,408      cycles                    #    3.418 GHz                
      
( +-  1.39% )
        35,100,950      instructions              #    1.46  insn per cycle     
      
( +-  0.17% )
         6,197,972      branches                  #  878.884 M/sec              
      
( +-  0.15% )
            56,045      branch-misses             #    0.90% of all branches    
      
( +-  1.41% )

       0.272121270 seconds time elapsed                                         
 
( +- 52.41% )
~~~~~

> > I simply wonder whether anyone else is actually using lttng-ust, or if
> > they
> > are, whether they are only doing that for long-lived applications where
> > this overhead is not noticeable?
> 
> It appears that this application startup slowdown mainly appears with
> LTTng linked against liburcu 0.9 and 0.10 running with rather recent
> kernels. This would explain why you are the first one to report this across
> our user base.

Thank you, that explains it.

-- 
Milian Wolff | [email protected] | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
lttng-dev mailing list
[email protected]
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

Re: [lttng-dev] large liblttng-ust startup overhead (ust_lock)

Reply via email to