on 11/09/2013 04:26 Pqf 潘庆峰 said the following: > I think your problem is find out how the zombies stay. Actually I can't tell > base on the information you gave, but I think you can find out with these: > 1. find out the pid of PM > 2. use strace -p $PM_pid (linux) or truss -p $PM_pid(Solaris), it will tell > you what PM doing, is the waitpid() called? is waitpid() return error? or the > PM just die itself for some reasons? ...and other useful information.
Sorry that I was not clear about this in my original post. The PM is doing well: it's running and it's calling waitpid on other processes. It does not call waitpid on the zombie processes in question because they are still on the busy list. And it seems that the PM never checks processes on the busy list. I've been thinking about this problem and the only theory that I have got so far is that perhaps an owner httpd process could terminate ungracefully (e.g. crash). In that case the pool cleanup would never be run. That's OK for process local resources like memory or file descriptors, which would be freed by OS because the process dies anyway. But that's not OK for external resources like other processes. In other words, if an httpd process marks an fcgid process as busy and then suddenly dies, then there is nobody to move the fcgid process back to the idle list. -- Andriy Gapon
