Re: Killing a zombie process?

2015-10-02 Thread Paul Goyette

On Fri, 2 Oct 2015, Paul Goyette wrote:


For now, I took a quick look into the zombie's struct proc.

p_exitsig = 0x14   = SIGCHILD
p_flag= 0x0
p_sflag   = 0x2000 = PS_WEXIT
p_slflag  = 0x0
p_lflag   = 0x2= PL_CONTROLT
p_stflag  = 0x0
p_stat= 0x5= SZOMB

p_trace_enabled = 0x0
p_pid = 0x5280 = 21120 (the same value shown by ps)

I don't see anything unusual here.

I have attached the hex-dump in case anyone wants to look a little bit 
closer.


OK, I forced a system crash (using ddb's sync command), and here's what 
gdb says about the zombie's struct proc (manually inserted line breaks 
for improved readability, and some flag value annotations)


(gdb) print (struct proc *) 0xfe81f578ba70
$1 = (struct proc *) 0xfe81f578ba70
(gdb) print *(struct proc *) 0xfe81f578ba70
$2 = {
  p_list = {le_next = 0x0, le_prev = 0x806be700 },
  p_auxlock = {u = {mtxa_owner = 0}},
  p_lock = 0xfe81fbb7a840,
  p_stmutex = {u = {mtxa_owner = 2049}},
  p_reflock = {rw_owner = 0},
  p_waitcv = {cv_opaque = {0x0, 0xfe81f578baa0, 0x804d542e}},
  p_lwpcv = {cv_opaque = {0x0, 0xfe81f578bab8, 0x804e7f9a}},
  p_cred = 0xfe81ef0106c0,
  p_fd = 0xfe810f46f680,
  p_cwdi = 0x0,
  p_stats = 0xfe81e00b5700,
  p_limit = 0xfe8155fe8de8,
  p_vmspace = 0x80722de0 ,
  p_sigacts = 0xfe803be9b258,
  p_aio = 0x0,
  p_mqueue_cnt = 0,
  p_specdataref = {
specdataref_container = 0x0,
specdataref_lock = {u = {mtxa_owner = 18446744073709551600}}},
  p_exitsig = 20,
  p_flag = 0,
  p_sflag = 8192 ,
  p_slflag = 0,
  p_lflag = 2 ,
  p_stflag = 0,
  p_stat = 5 '\005' ,
  p_trace_enabled = 0 '\000',
  p_pad1 = "\203",
  p_pid = 21120,
  p_pglist = {
le_next = 0x0,
le_prev = 0xfe81eab655b0},
  p_pptr = 0xfe810f45ecd0,
  p_sibling = {
le_next = 0xfe81f7618d20, le_prev = 0xfe81fc805108},
  p_children = {lh_first = 0x0},
  p_lwps = {lh_first = 0xfe8021ccb560},
  p_raslist = 0x0,
  p_nlwps = 1,
  p_nzlwps = 1,
  p_nrlwps = 0,
  p_nlwpwait = 0,
  p_ndlwps = 0,
  p_nlwpid = 1,
  p_nstopchild = 0,
  p_waited = 0,
  p_zomblwp = 0x0,
  p_vforklwp = 0x0,
  p_sched_info = 0x0,
  p_estcpu = 0,
  p_estcpu_inherited = 36864,
  p_forktime = 17842,
  p_pctcpu = 0,
  p_opptr = 0x0,
  p_timers = 0x0,
  p_rtime = {sec = 0, frac = 0},
  p_uticks = 0,
  p_sticks = 0,
  p_iticks = 0,
  p_traceflag = 0,
  p_tracep = 0x0,
  p_textvp = 0xfe81e6023190,
  p_emul = 0x806b6300 ,
  p_emuldata = 0x0,
  p_execsw = 0x808be0e0,
  p_klist = { slh_first = 0x0},
  p_sigwaiters = {lh_first = 0x0},
  p_sigpend = {
sp_info = {tqh_first = 0x0, tqh_last = 0xfe81f578bc48},
sp_set = {__bits = {0, 0, 0, 0}}},
  p_lwpctl = 0x0,
  p_ppid = 1,
  p_fpid = 0,
  p_sigctx = {
ps_signo = 0, ps_code = 0, ps_lwp = 0, ps_sigcode = 0x0,
ps_sigignore = {__bits = {4294967295, 4294967295, 4294967295, 4294967295}},
ps_sigcatch = {__bits = {0, 0, 0, 0}}},
  p_nice = 20 '\024',
  p_comm = "sh\000ke", '\000' ,
  p_pgrp = 0xfe81eab655b0,
  p_psstrp = 140187732541408,
  p_pax = 0,
  p_xstat = 0,
  p_acflag = 1,
  p_md = {md_flags = 0, md_syscall = 0x8012f010 },
  p_stackbase = 140187732541440,
  p_dtrace = 0x7f7ff683b8e6}

As far as I can tell, everything looks normal.  Yet the process never 
gets reaped by init.


The one thing that surprises me here is that the zombie still has a 
pointer to p_textvp which would point to /bin/sh _within_ the chroot() 
sandbox (consistent with the p_comm = "sh" entry).  I'm guessing that 
this reference is what's preventing me from unmounting this nullfs 
mount.  (I previously expected the inability to unmount to be the result 
of a reference from the zombie's cwd.)



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-10-02 Thread Paul Goyette

On Fri, 2 Oct 2015, Paul Goyette wrote:


On Fri, 2 Oct 2015, Paul Goyette wrote:


For now, I took a quick look into the zombie's struct proc.

p_exitsig = 0x14   = SIGCHILD
p_flag= 0x0
p_sflag   = 0x2000 = PS_WEXIT
p_slflag  = 0x0
p_lflag   = 0x2= PL_CONTROLT
p_stflag  = 0x0
p_stat= 0x5= SZOMB

p_trace_enabled = 0x0
p_pid = 0x5280 = 21120 (the same value shown by ps)

I don't see anything unusual here.

I have attached the hex-dump in case anyone wants to look a little bit 
closer.


OK, I forced a system crash (using ddb's sync command), and here's what gdb 
says about the zombie's struct proc (manually inserted line breaks for 
improved readability, and some flag value annotations)


(gdb) print (struct proc *) 0xfe81f578ba70
$1 = (struct proc *) 0xfe81f578ba70
(gdb) print *(struct proc *) 0xfe81f578ba70
$2 = {
 p_list = {le_next = 0x0, le_prev = 0x806be700 },
 p_auxlock = {u = {mtxa_owner = 0}},
 p_lock = 0xfe81fbb7a840,
 p_stmutex = {u = {mtxa_owner = 2049}},
 p_reflock = {rw_owner = 0},
 p_waitcv = {cv_opaque = {0x0, 0xfe81f578baa0, 0x804d542e}},
 p_lwpcv = {cv_opaque = {0x0, 0xfe81f578bab8, 0x804e7f9a}},
 p_cred = 0xfe81ef0106c0,
 p_fd = 0xfe810f46f680,
 p_cwdi = 0x0,
 p_stats = 0xfe81e00b5700,
 p_limit = 0xfe8155fe8de8,
 p_vmspace = 0x80722de0 ,
 p_sigacts = 0xfe803be9b258,
 p_aio = 0x0,
 p_mqueue_cnt = 0,
 p_specdataref = {
   specdataref_container = 0x0,
   specdataref_lock = {u = {mtxa_owner = 18446744073709551600}}},
 p_exitsig = 20,
 p_flag = 0,
 p_sflag = 8192 ,
 p_slflag = 0,
 p_lflag = 2 ,
 p_stflag = 0,
 p_stat = 5 '\005' ,
 p_trace_enabled = 0 '\000',
 p_pad1 = "\203",
 p_pid = 21120,
 p_pglist = {
   le_next = 0x0,
   le_prev = 0xfe81eab655b0},
 p_pptr = 0xfe810f45ecd0,
 p_sibling = {
   le_next = 0xfe81f7618d20, le_prev = 0xfe81fc805108},
 p_children = {lh_first = 0x0},
 p_lwps = {lh_first = 0xfe8021ccb560},
 p_raslist = 0x0,
 p_nlwps = 1,
 p_nzlwps = 1,
 p_nrlwps = 0,
 p_nlwpwait = 0,
 p_ndlwps = 0,
 p_nlwpid = 1,
 p_nstopchild = 0,
 p_waited = 0,
 p_zomblwp = 0x0,
 p_vforklwp = 0x0,
 p_sched_info = 0x0,
 p_estcpu = 0,
 p_estcpu_inherited = 36864,
 p_forktime = 17842,
 p_pctcpu = 0,
 p_opptr = 0x0,
 p_timers = 0x0,
 p_rtime = {sec = 0, frac = 0},
 p_uticks = 0,
 p_sticks = 0,
 p_iticks = 0,
 p_traceflag = 0,
 p_tracep = 0x0,
 p_textvp = 0xfe81e6023190,
 p_emul = 0x806b6300 ,
 p_emuldata = 0x0,
 p_execsw = 0x808be0e0,
 p_klist = { slh_first = 0x0},
 p_sigwaiters = {lh_first = 0x0},
 p_sigpend = {
   sp_info = {tqh_first = 0x0, tqh_last = 0xfe81f578bc48},
   sp_set = {__bits = {0, 0, 0, 0}}},
 p_lwpctl = 0x0,
 p_ppid = 1,
 p_fpid = 0,
 p_sigctx = {
   ps_signo = 0, ps_code = 0, ps_lwp = 0, ps_sigcode = 0x0,
   ps_sigignore = {__bits = {4294967295, 4294967295, 4294967295, 
4294967295}},

   ps_sigcatch = {__bits = {0, 0, 0, 0}}},
 p_nice = 20 '\024',
 p_comm = "sh\000ke", '\000' ,
 p_pgrp = 0xfe81eab655b0,
 p_psstrp = 140187732541408,
 p_pax = 0,
 p_xstat = 0,
 p_acflag = 1,
 p_md = {md_flags = 0, md_syscall = 0x8012f010 },
 p_stackbase = 140187732541440,
 p_dtrace = 0x7f7ff683b8e6}

As far as I can tell, everything looks normal.  Yet the process never gets 
reaped by init.


The one thing that surprises me here is that the zombie still has a pointer 
to p_textvp which would point to /bin/sh _within_ the chroot() sandbox 
(consistent with the p_comm = "sh" entry).  I'm guessing that this reference 
is what's preventing me from unmounting this nullfs mount.  (I previously 
expected the inability to unmount to be the result of a reference from the 
zombie's cwd.)


Still investigating, but I think I may have found something...

Using the p_pptr value 0xfe810f45ecd0 from the zombie's struct proc, 
I examined the struct proc for init.  I followed the code from the 
find_stopped_child() routine in src/sys/kern/kern_exit.c, and walked 
through the loop for each of init's children.  The first several 
processes are all in p_state=4 (SSTOP), yet init's p_nstopchild count is 
zero!


This seems to cause the loop in find_stopped_child() to exit early (at 
line 790):


 if (parent->p_nstopchild == 0 || child->p_pid == pid) {
 child = NULL;
 break;

(Here, parent points to init's struct proc, child is the struct proc 
obtained from walking the p_children list, and pid is the argument 
passed to the wait4() syscall - init passes value WAIT_ANY, ie -1.)


Questions:

1. Is it correct for init's p_nstopchild to be zero when it has several
   children whose p_state is SSTOP?

2. Is the above code in init correct?  Should we really be leaving the
   loop when there are more children to examine?





+--+--+-+
| Paul Goyette | PGP Key fingerprint: