On Thu, 23 May 2002, Bob Crandell wrote:
>I have a server with a growing number of these:
>20344 ? Z 0:00 [sh <defunct>]
>20354 ? Z 0:00 [sh <defunct>]
>20355 ? Z 0:00 [sh <defunct>]
>20363 ? Z 0:00 [sh <defunct>]
>
>How do I get rid if them? How do I find out what's causing them?
What you have there are "zombie processes"--the corpses of dead processes
that haven't been buried yet.
When a UNIX process exits, the kernel unloads its code and data, but keeps
the process data structures in memory, so that its parent process can
collect statistics about its behavior (the "time" command, for example,
uses this information to tell you how much CPU time a program used). When
the parent process makes the "wait()" system call (or one of its
relatives), the status of the exited child is returned to it, and the
kernel releases the process data structures.
Your symptoms happen when a parent process fails to wait() for its
children. In that case, the child's kernel data structures remain in
memory, marked "Z" in the ps list. Even "kill -9" doesn't get rid of it,
since it's already dead.
What you need to do is find out the identity of the parent process. The
command "ps alx" will show you all your processes, with the process ID of
the parent listed in the "PPID" column. Then do "ps <PPID>", where
"<PPID>" is the process id of the parent that you just found.
Now you know the identity of the culprit. Figuring out what to do about it
is another story. Depending on the process, there might be something you
can do to kick it and make it realize it has work to do. If not, stopping
and restarting the process should get rid of its zombie children.
Sometimes you may find that the process ID of the errant parent is 1.
Process ID 1 is always "init", and it's the ancestor of all processes, and
the most important process on the system...you can't just stop and restart
it. The command "/sbin/telinit q" may help...if it doesn't, you may have
to reboot the system to get rid of the zombies.
- Neil Parker, [EMAIL PROTECTED]