Hi Ryan,

On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote:
>     collectd 4.8.1, http://collectd.org/

no applicable problem has been fixed in the Exec plugin in the meantime,
so the problem should still exist in the master branch.

> Once I notice the plugin has stopped reporting, I have an extra
> process (28489) hanging around:
>     $ pstree -apu 22935
>     collectdmon,22935 -P /var/run/collectdmon.pid -- -C
> /etc/collectd/collectd.conf
>       collectd,22936 -C /etc/collectd/collectd.conf -f
>           collectd,28489 -C /etc/collectd/collectd.conf -f
>           {collectd},22937
>           {collectd},22938
>           {collectd},22939
>           {collectd},22940
>           {collectd},22941
>           {collectd},28487

> That process seems to exist only when the exec plugin is no longer
> reporting. Sometimes there's two of these processes.

This looks like the code that is supposed to spawn a new instance of the
script failed after fork(2) but before exec(2).

There are various cases in which the exec(2) is not reached in
"exec_child()", but they all emit an error message. I take it there is
no error message somewhere in the logs or in syslog?

> strace reports that the extra process is sitting in a mutex. It never
> leaves this state:
>     $ sudo strace -p 28489
>     Process 28489 attached - interrupt to quit
>     futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL

There is a mutex in the exec plugin, but I doubt that this is the
problem. It is held just before a thread is spawned (to set a flag) and
just before that thread exits (to reset the flag). I don't see any way
this could lead to a deadlock or starvation.

I'm much more concerned about the SIGCHLD handler and the various
waitpid(2)s in the code. I could see the controlling thread missing its
child's signal and waiting forever. This shouldn't create weird new
processes though.

> Any ideas what might be going on here or information I could provide
> to help find a root cause?

I'm a bit puzzled by the described behavior, I have to admit. Maybe you
could provide the "lsof -p $PID" output for one of those weird child

While looking into this I did find a path in "exec_read_one()" where the
function returned without clearing the "PL_RUNNING" flag. I don't see
how this could produce a child process, but maybe it's worth a try. The
commit is 66c0d62 ([0]).


Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D

Attachment: signature.asc
Description: Digital signature

collectd mailing list

Reply via email to