> I presume that this is the reason why mod_fcgid does not use a SIGCHLD > handler. mod_fcgid have to work on both UNIX and Win32, so the it pick a portable way to make it done on both platforms.
> Given the above, I can not find any holes in the mod_fcgid logic which could > lead to unreaped zombies. Yes, the logic is exactly what you said. I think your problem is find out how the zombies stay. Actually I can't tell base on the information you gave, but I think you can find out with these: 1. find out the pid of PM 2. use strace -p $PM_pid (linux) or truss -p $PM_pid(Solaris), it will tell you what PM doing, is the waitpid() called? is waitpid() return error? or the PM just die itself for some reasons? ...and other useful information. Good luck :) > > We've got a problem where sometimes some child processes of mod_fcgid PM > terminate, but are never cleaned up by the PM. > > Below I first state my understanding about how mod_fcgid process management > works and then I describe what I see and make some conclusions based on that. > > Could you please check my understanding for correctness? > Perhaps you would have any suggestions on how to debug / workaround / fix the > problem? > > An important mote is that we still use apache 2.2.23 and mod_fcgid 2.3.5. > Yeah, > I know! We plan to upgrade really soon. > > Thank you very much! > > First a note that the safest way to take care of process'es children is to > handle SIGCHLD. This way no child termination can be missed. > The trade-off is that any program with non-default signal handlers becomes > asynchronously concurrent and that imposes certain rules and limitations on > the > code. > I presume that this is the reason why mod_fcgid does not use a SIGCHLD > handler. > > My analysis of mod_fcgid's process management follows. > mod_fcgid keeps three lists of child processes. The lists are kept in a > special > structure allocated in shared memory. Thus it can be inspected and modified by > multiple processes. > The lists are: > - idle list is a list of processes that currently do not perform any work and > can be re-used for a new request processing > - error list is a list of processes that had any communication problem (e.g. > an > error writing to their socket or timeout waiting for a reply) > - busy list is a list of processes that are performing any work (or at least > supposed to be) > > mod_fcgid code running in apache worker processes directly inspects the lists. > The code picks up a process, if any is available, from the idle list and > inserts > it into the busy list. > Communication to the process is done directly via a local socket. > If there is no available process in the idle list, then the code sends a spawn > request to a special dedicated mod_fcgid process (it appears as another apache > process). > The process is known as Process Manager (PM). > The PM spawns a new process upon the request. Thus it is a parent process of > all > fcgid workers. The new process is inserted into the idle list if spawning is > successful. > The original apache process waits a little bit after issuing the spawn request > and then re-examines the idle list. > There is a hardcoded limit on a number of retries / iterations that can be > done > until the code gives up on the attempt to grab an idle process. > > The PM periodically (with configurable periods, default 3 seconds) walks the > idle and the error lists and executes a non-blocking waitpid() call on every > process in the lists. > This way the PM detects the idle or "errored" processes that have terminated > in > any fashion. > It must be noted that until the waitpid call the terminated processes are kept > by Unix-like operating systems as "zombies". > After waitpid call, which collects their termination information, the zombies > are reaped. > > The PM never walks the busy list. > A different mechanism is used for managing processes on the busy list. > Apache has a concept of resource pools. For example, all memory allocations > must > refer to a pool. > When the pool is cleared or destroyed all memory allocated from it is > automatically cleaned up. > Additionally, it is possible to register an arbitrary object and a cleanup > callback with the pool. > When the pool is cleared or destroyed all the registered callbacks are called > upon their associated objects. > > To avoid any memory / resource leaks apache creates separate pools per each > configured server, per each connection and per each request. > All the code is supposed to use an appropriate pool based on the scope of its > operation. > When fcgid code grabs a process to handle a request and puts it on the busy > list > the code also registers a process handle and a special callback with a pool > allocated for the request in question. > The callback function moves the process from the busy list back to the idle > list > if there was no problems, or to the error list. > Thus, if the apache server and the apache framework work as expected / > documented, then the process should be "unbusied" as soon as the request is > handled. > > Given the above, I can not find any holes in the mod_fcgid logic which could > lead to unreaped zombies. > > On the affected system I observe that mod_fcgid reports the zombie processes > as > still working (being on the busy list). > For example: > $ sudo ps axwwl | fgrep -w Z > 2084 67497 71375 0 20 0 0 0 - Z ?? 0:01.15 <defunct> > 2125 82246 71375 0 20 0 0 0 - Z ?? 0:24.08 <defunct> > > Process name: php-fastcgi-wrapper > Pid Active Idle Accesses State > 67497 275184 275174 1 Working > Process name: php-fastcgi-wrapper > Pid Active Idle Accesses VirtualHost State > 82246 335933 335672 119 Working > > So, this leads me to conclude that the problem lies somewhere in the apache > server code or in the apache pool management code. > Apparently the process cleanup callback has never been called for these > processes and thus they are stuck on the busy list. > Even more obvious is that the processes terminated in some fashion, most > likely > crashed. > Possibly there is a correlation between these two observations, maybe some > error > conditions result in request cleanup not being properly done. > > -- > Andriy Gapon > >
