(Thanks for the tip that helped lead to the dlclose() solution.) Now, I'm running into an issue where the exec plugin's child process hangs in a call to getpwname_r() from inside exec_child(). The hang happens reliably if I use upstart to start collectdmon, which starts collectd. If I manually start collectdmon, everything works the way it is supposed to. (This hang happened even before I added the dlclose() patch, and it happens earlier in exec_child() than anything done by the dlclose() patch.) This is with collectd 5.2.0 on Ubuntu 12.04LTS. The 'top' command reports 32 logical cpus. I have only one item that the exec plugin should be calling. If I gather correctly, my getpwnam_r() comes from Glibc, not from the ifdef'ed function in common.c. It appears this issue may be related to the following issues:
http://mailman.verplant.org/pipermail/collectd/2010-March/003650.html https://github.com/collectd/collectd/issues/229 https://gist.github.com/jessereynolds/2878994 http://www.mail-archive.com/[email protected]/msg00524.html and possibly http://monkey.org/freebsd/archive/freebsd-threads/200307/msg00110.html Here's the output of strace on a hung child process: Process 2435 attached - interrupt to quit futex(0x7fd431c3cdb0, FUTEX_WAIT_PRIVATE, 2, NULL Here's the output of the gdb 'where' command on a hung child process: (gdb) where #0 0x00007fcf120d09bb in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fcf120d591c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fcf120d4f2b in __nss_database_lookup () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007fcf120d630c in __nss_passwd_lookup2 () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x00007fcf1208dac8 in getpwnam_r () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007fcf0c6e38ed in exec_child (pl=0x18e5e90) at exec.c:303 #6 fork_child (pl=0x18e5e90, fd_in=<optimized out>, fd_out=<optimized out>, fd_err=0x7fcf09aca5e8) at exec.c:509 #7 0x00007fcf0c6e3f5e in exec_read_one (arg=0x18e5e90) at exec.c:560 #8 0x00007fcf12599e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #9 0x00007fcf120c2cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #10 0x0000000000000000 in ?? () (I could compile with --enable-debug, but that's a significant effort and it doesn't appear it would add much info at this point.) Here's /proc/$pid/status for a hung child process: Name: collectd State: S (sleeping) Tgid: 128026 Pid: 128026 PPid: 128018 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 64 Groups: VmPeak: 624472 kB VmSize: 624472 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 1484 kB VmRSS: 1484 kB VmData: 516584 kB VmStk: 136 kB VmExe: 164 kB VmLib: 10916 kB VmPTE: 268 kB VmSwap: 0 kB Threads: 1 SigQ: 5/256851 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 0000000180014202 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-127 Mems_allowed: 00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 9 nonvoluntary_ctxt_switches: 0 Some documents on the web said getpwnam_r() could hang if using NIS or LDAP. Here's my /etc/nsswitch.conf: # /etc/nsswitch.conf # # Example configuration of GNU Name Service Switch functionality. # If you have the `glibc-doc-reference' and `info' packages installed, try: # `info libc "Name Service Switch"' for information about this file. passwd: compat group: compat shadow: compat hosts: files dns networks: files protocols: db files services: db files ethers: db files rpc: db files netgroup: nis If I gather correctly, 'compat' for the passwd, group, and shadow entries should be equivalent to 'files' if there are no exception items. The 'netgroup: nis' entry is apparently from the stock Ubuntu installation. As far as I am aware, there is no NIS active for this machine. Is anyone here acquainted with reasons getpwnam_r() might hang and/or a better solution than adding a 'sleep(1)' mentioned in a reference URL? (It's likely a sleep wouldn't help my case, because I only have one thing called by the exec plugin.) Thanks, Robert Riches
_______________________________________________ collectd mailing list [email protected] http://mailman.verplant.org/listinfo/collectd
