----- Original Message ----- > From: "Mathieu Desnoyers" <mathieu.desnoy...@efficios.com> > To: "David OShea" <david.os...@quantum.com> > Cc: "lttng-dev" <lttng-dev@lists.lttng.org> > Sent: Monday, January 12, 2015 10:34:37 AM > Subject: Re: [lttng-dev] Segfault at v_read() called from > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app > - CPU/VMware dependent > > ----- Original Message ----- > > From: "David OShea" <david.os...@quantum.com> > > To: "Mathieu Desnoyers" <mathieu.desnoy...@efficios.com> > > Cc: "lttng-dev" <lttng-dev@lists.lttng.org> > > Sent: Monday, January 12, 2015 1:33:07 AM > > Subject: RE: [lttng-dev] Segfault at v_read() called from > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app > > - CPU/VMware dependent > > > > Hi Mathieu, > > > > Apologies for the delay in getting back to you, please see below: > > > > > -----Original Message----- > > > From: Mathieu Desnoyers [mailto:mathieu.desnoy...@efficios.com] > > > Sent: Friday, 12 December 2014 2:07 AM > > > To: David OShea > > > Cc: lttng-dev > > > Subject: Re: [lttng-dev] Segfault at v_read() called from > > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware > > > dependent > > > > > > ________________________________ > > > > > > From: "David OShea" <david.os...@quantum.com> > > > To: "lttng-dev" <lttng-dev@lists.lttng.org> > > > Sent: Sunday, December 7, 2014 10:30:04 PM > > > Subject: [lttng-dev] Segfault at v_read() called from > > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware > > > dependent > > > > > > > > > > > > Hi all, > > > > > > We have encountered a problem with using LTTng-UST tracing with > > > our application, where on a particular VMware vCenter cluster we almost > > > ways get segfaults when tracepoints are enabled, whereas on another > > > vCenter cluster, and on every other machine we’ve ever used, we don’t > > > hit this problem. > > > > > > I can reproduce this using lttng-ust/tests/hello after using: > > > > > > """ > > > > > > lttng create > > > > > > lttng enable-channel channel0 --userspace > > > > > > lttng add-context --userspace -t vpid -t vtid -t procname > > > > > > lttng enable-event --userspace "ust_tests_hello:*" -c channel0 > > > > > > lttng start > > > > > > """ > > > > > > In which case I get the following stack trace with an obvious > > > NULL pointer dereference: > > > > > > """ > > > > > > Program terminated with signal SIGSEGV, Segmentation fault. > > > > > > #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48 > > > > > > 48 return uatomic_read(&v_a->a); > > > > > > [...] > > > > > > #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48 > > > > > > #1 0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow ( > > > > > > buf=0x7f4a98008a00, chan=0x7f4a98008a00, > > > offsets=0x7fffef67c620, > > > > > > ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677 > > > > > > #2 0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow > > > (ctx=0x7fffef67ca40) > > > > > > at ring_buffer_frontend.c:1819 > > > > > > #3 0x00007f4aa1095b75 in lib_ring_buffer_reserve > > > (ctx=0x7fffef67ca40, > > > > > > config=0x7f4aa12b8ae0 <client_config>) > > > > > > at ../libringbuffer/frontend_api.h:211 > > > > > > #4 lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0) > > > > > > at lttng-ring-buffer-client.h:473 > > > > > > #5 0x000000000040135f in __event_probe__ust_tests_hello___tptest > > > ( > > > > > > __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50, > > > > > > text=0x7fffef67cb70 "test", textlen=<optimized out>, > > > doublearg=2, > > > > > > floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32 > > > > > > #6 0x0000000000400d2c in > > > __tracepoint_cb_ust_tests_hello___tptest ( > > > > > > boolarg=true, floatarg=2222, doublearg=2, textlen=4, > > > > > > text=0x7fffef67cb70 "test", values=0x7fffef67cb50, > > > > > > netint=<optimized out>, anint=0) at ust_tests_hello.h:32 > > > > > > #7 main (argc=<optimized out>, argv=<optimized out>) at > > > hello.c:92 > > > > > > """ > > > > > > I hit this segfault 10 out of 10 times I ran “hello” on a VM on > > > one vCenter and 0 out of 10 times I ran it on the other, and the VMs > > > otherwise had the same software installed on them: > > > > > > - CentOS 6-based > > > > > > - kernel-2.6.32-504.1.3.el6 with some minor changes made in > > > networking > > > > > > - userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2 > > > which might have some minor patches backported, and leftovers of > > > changes to get them to build on CentOS 5 > > > > > > On the “good” vCenter, I tested on two different VM hosts: > > > > > > Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz > > > > > > EVC Mode: Intel(R) "Nehalem" Generation > > > > > > Image Profile: (Updated) ESXi-5.1.0-799733-standard > > > > > > Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz > > > > > > EVC Mode: Intel(R) "Nehalem" Generation > > > > > > Image Profile: (Updated) ESXi-5.1.0-799733-standard > > > > > > The “bad” vCenter VM host that I tested on had this > > > configuration: > > > > > > ESX Version: VMware ESXi, 5.0.0, 469512 > > > > > > Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz > > > > > > Any ideas? > > > > > > > > > My bet would be that the OS is lying to userspace about the > > > number of possible CPUs. I wonder what liblttng-ust > > > libringbuffer/shm.h num_possible_cpus() is returning compared > > > to what lib_ring_buffer_get_cpu() returns. > > > > > > > > > Can you check this out ? > > > > Yes, this seems to be the case - 'gdb' on the core dump shows: > > > > (gdb) p __num_possible_cpus > > $1 = 2 > > > > which is consistent with how I configured the virtual machine, which is > > consistent with this output: > > > > # lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Byte Order: Little Endian > > CPU(s): 2 > > On-line CPU(s) list: 0,1 > > Thread(s) per core: 1 > > Core(s) per socket: 1 > > Socket(s): 2 > > NUMA node(s): 1 > > Vendor ID: GenuineIntel > > CPU family: 6 > > Model: 26 > > Stepping: 4 > > CPU MHz: 1995.000 > > BogoMIPS: 3990.00 > > Hypervisor vendor: VMware > > Virtualization type: full > > L1d cache: 32K > > L1i cache: 32K > > L2 cache: 256K > > L3 cache: 18432K > > NUMA node0 CPU(s): 0,1 > > > > Despite the fact that there are 2 CPUs, when I hacked > > lttng-ring-buffer-client.h to output the result of > > lib_ring_buffer_get_cpu() > > and then ran tests/hello with tracing enabled, I could see it would sit on > > CPU 0 for a while, or CPU 1, and perhaps move between the two, but > > eventually either 2 or 3 would appear, immediately followed by the > > segfault. > > > > The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading > > enabled. > > The VM has its "HT Sharing" option set to "Any", which according to > > https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html > > means that each one of the virtual machine's virtual cores can share a > > physical core with another virtual machine, each virtual core using a > > different thread on that physical core. I assume none of this should be > > relevant except perhaps if there are bugs in VMware. > > > > Is it possible that this is an issue in LTTng, or should I work out how the > > kernel works out which CPU it is running on and then look into whether > > there > > are any VMware bugs in this area? > > This appears to be very likely a VMware bug. /proc/cpuinfo should show > 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the current > CPU number can be 0, 1, 2, 3 throughout execution.
You might want to look at the sysconf(3) manpage, especially the parts about _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that vmware is lying about the number of "possible" CPUs (_SC_NPROCESSORS_CONF). Thanks, Mathieu > > Thanks, > > Mathieu > > > > > > Thanks in advance, > > David > > > > ---------------------------------------------------------------------- > > The information contained in this transmission may be confidential. Any > > disclosure, copying, or further distribution of confidential information is > > not permitted unless such privilege is explicitly granted in writing by > > Quantum. Quantum reserves the right to have electronic communications, > > including email and attachments, sent across its networks filtered through > > anti virus and spam software programs and retain such messages in order to > > comply with applicable data security and retention requirements. Quantum is > > not responsible for the proper and complete transmission of the substance > > of > > this communication or for any delay in its receipt. > > > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com _______________________________________________ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev