> -----Original Message----- > From: Mathieu Desnoyers [mailto:mathieu.desnoy...@efficios.com] > Sent: Wednesday, March 14, 2018 6:25 PM > To: THEUNISSEN Rolf <rolf.theunis...@altran.com> > Cc: lttng-dev <lttng-dev@lists.lttng.org> > Subject: Re: [lttng-dev] [lttng-ust] Missing events when just before a process > exits > > ----- On Mar 14, 2018, at 12:37 PM, THEUNISSEN Rolf > rolf.theunis...@altran.com wrote: > > >> -----Original Message----- > >> From: Jonathan Rajotte-Julien > >> [mailto:jonathan.rajotte-jul...@efficios.com] > >> Sent: Tuesday, March 13, 2018 4:35 PM > >> To: THEUNISSEN Rolf <rolf.theunis...@altran.com> > >> Cc: lttng-dev@lists.lttng.org > >> Subject: Re: [lttng-dev] [lttng-ust] Missing events when just before > >> a process exits > >> > >> Hi Rolf, > >> > >> On Tue, Mar 13, 2018 at 10:48:17AM +0000, THEUNISSEN Rolf wrote: > >> > Hi, > >> > > >> > I am currently tracing many processes with LTTng-UST on a system > >> > under > >> heavy load, with 2 CPUs. > >> > > >> > The traces seem to be missing ust-events near the end of a trace. > >> > After > >> doing some analysis, it seems that events are missing from the other > >> CPU then the process exits on. More concretely: > >> > > >> > During the lifetime of a process: > >> > - Process X executes on CPU 0,1: all events are in the trace Just > >> > before the process is about to exit: > >> > - Process X executes on CPU 0: missing events from the trace > >> > >> How are you validating this? > > > > I am also using kernel tracing, to trace the scheduler. In > > TraceCompass I can view the states of the processes and CPUs. > > > >> > >> > A few mili-seconds later: > >> > - Process X executes on CPU1 and exits: The last events are in the > >> > trace > >> > >> This is not expected. Do you have a reproducer so we can check what > >> might be happening? > > > > I am not allowed to provide details about my current traces. And it is > > hard to make a small example, but I got some more clues what is > happening, see below. > > > >> > >> We will need the version for lttng-ust, lttng-tools, babeltrace. > > > > I am using version 2.10, for analyzing the traces I am using > > TraceCompass Kernel version: 3.10.62 > > > >> > >> Note that you can run your application with lttng-ust in debug mode > >> using LTTNG_UST_DEBUG=y. Debug statements will be outputted on > stderr. > >> > >> e.g: > >> $ LTTNG_UST_DEBUG=y ./my_sample_app > >> > > > > Enableling this debugging flag gave me a clue what is happening. For > > all processes, the registration is going OK. But for some processes, > > the un-registration is not logged. On closer analysis (now with > > tracepoints from SystemTap), I see that the processes for which the > > events are missing and the un-registration is missing are killed by > > SIGKILL signal. This probably explains why some data can be lost. But > > maybe strange that only messages get lost that were traced on a different > CPU. > > > > I remain puzzled why the SIGKILL arrives at the process, as a SIGTERM > > is sent to the process. Somehow the kernel decides to deliver an > > SIGKILL instead of a SIGTERM. > > When using lttng-ust with per-uid buffers, there is a known design limitation > regarding process terminated by non-handled kill signals: if this occurs while > the buffer is being written to (between reserve and commit), it will cause the > rest of the per-cpu buffer to be unreadable. This applies to all events > recorded by all applications of the same user id for that particular CPU. > Recovering from this involves destroying and re-creating the tracing session. > > If this kind of scenario is likely for you, you might want to consider using > per- > pid buffers (lttng enable-channel -u --buffers-pid), which do not suffer from > this design limitation. The downside of per-pid buffers is that it allocates > more memory on your system for buffering, and adds extra overhead when > used with processes that have a short life-time. > > Thanks, > > Mathieu >
Thanks for your reply, indeed I was using --buffers-uid when I first observed this issue. Then I switched back to --buffers-pid, as I suspected it to be more robust, however the issue did not resolve (though it might occur less often). Anyways, it is expected to lose data when a process is SIGKILLed. It might be worth mentioning above design restriction in the man pages of lttng. Thanks for your time. Kind Regards, Rolf > > > > > Rolf > > > >> Cheers > >> > >> -- > >> Jonathan Rajotte-Julien > >> EfficiOS > > _______________________________________________ > > lttng-dev mailing list > > lttng-dev@lists.lttng.org > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.lttng.org_c > > gi-2Dbin_mailman_listinfo_lttng- > 2Ddev&d=DwICaQ&c=cxWN2QSDopt5SklNfbjIj > > > g&r=XBg4LogZ8Z2xGquh4s7VDGwsgBcwKOdV99wapS1T0K8&m=1VAbXAhF9 > 0qZpqUripzg > > XtT8byswpHwO42jCevAQHQI&s=Ev3_wi-aSOTDFTiHQU6sTXBFFbZbY2kKV- > up4DWj2ts& > > e= > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://urldefense.proofpoint.com/v2/url?u=http- > 3A__www.efficios.com&d=DwICaQ&c=cxWN2QSDopt5SklNfbjIjg&r=XBg4Lo > gZ8Z2xGquh4s7VDGwsgBcwKOdV99wapS1T0K8&m=1VAbXAhF90qZpqUripzg > XtT8byswpHwO42jCevAQHQI&s=-AbmrFCL-5CGzKx2z1- > cRYdbOEWaxQtOAhoyNHtUojg&e= _______________________________________________ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev