Re: NFS related panic? (was: Re: Killing a zombie process?)

2015-11-03 Thread Rhialto
On Fri 23 Oct 2015 at 00:46:57 +0200, Rhialto wrote:
> This problem is very repeatable, usually within a few hours, just now it
> happened within half an hour.
> 
> It seems to me that somehow the nfs_reqq list gets corrupted. Then
> either there is a crash when traversing it in nfs_timer() (occurring in
> nfs_sigintr() due to being called with a bogus pointer), or there is a
> hang when one of the NFS requests gets lost and never retried.

I tried it with a TCP mount for NFS. Still hangs, this time in a bit
under an hour of uptime.

So the cause is likely not packet loss.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


signature.asc
Description: PGP signature


Re: NFS related panic? (was: Re: Killing a zombie process?)

2015-10-22 Thread Rhialto
This problem is very repeatable, usually within a few hours, just now it
happened within half an hour.

It seems to me that somehow the nfs_reqq list gets corrupted. Then
either there is a crash when traversing it in nfs_timer() (occurring in
nfs_sigintr() due to being called with a bogus pointer), or there is a
hang when one of the NFS requests gets lost and never retried.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


signature.asc
Description: PGP signature


Re: NFS related panic? (was: Re: Killing a zombie process?)

2015-10-19 Thread Rhialto
On Tue 20 Oct 2015 at 01:04:59 +0200, Rhialto wrote:
> with a rebuilt netbsd.gdb (hopefully the addresses match)
> 
> #5  0x806b94b4 in nfs_sigintr (nmp=0x0, rep=0xfe81163730a8,
> l=0x0) at ../../../../nfs/nfs_socket.c:871

nmp should not be NULL here... let's look at rep, where it comes from
via "nmp = rep->r_nmp;"

(gdb) print *(struct nfsreq *)0xfe81163730a8
$1 = {r_chain = {tqe_next = 0xfe811edcee40, tqe_prev = 0x1}, r_mreq = 
0x828f9888, r_mrep = 0x0, r_md = 0x0, r_dpos = 0x0, r_nmp = 0x0, r_xid 
= 0, r_flags = 0, r_retry = 0, r_rexmit = 0, r_procnum = 0, r_rtt = 0, 
  r_lwp = 0x0}

well, r_chain.tqe_prev looks bogus (unless that's a special marker), so
let's look at tqe_next:

(gdb) print *((struct nfsreq *)0xfe81163730a8)->r_chain.tqe_next
$3 = {r_chain = {tqe_next = 0x0, tqe_prev = 0x15aa3c85d}, r_mreq = 
0xbd83e8af8fe58282, r_mrep = 0x81e39981e3a781e3, r_md = 0xe39d81e38180e38c, 
r_dpos = 0x8890e5b4a0e5ae81, r_nmp = 0xe57baf81e3ab81e3, r_xid = 2179183259, 
  r_flags = -1565268289, r_retry = 0, r_rexmit = 0, r_procnum = 1520683101, 
r_rtt = 1, r_lwp = 0x80e39981e3a781e3}

well, even more bogus. Too bad that the next frame has its argument
optimized out...

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


signature.asc
Description: PGP signature


Re: NFS related panic? (was: Re: Killing a zombie process?)

2015-10-19 Thread Rhialto
with a rebuilt netbsd.gdb (hopefully the addresses match)

(gdb) target kvm netbsd.5.core
0x8063d735 in cpu_reboot (howto=howto@entry=260,
bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:671
671 dumpsys();
(gdb) bt
#0  0x8063d735 in cpu_reboot (howto=howto@entry=260,
bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:671
#1  0x80865182 in vpanic (fmt=0x80d123b2 "trap",
fmt@entry=0x80d123d2 "otection fault",
ap=ap@entry=0xfe80b9fc1d10) at ../../../../kern/subr_prf.c:340
#2  0x8086523d in panic (fmt=fmt@entry=0x80d123d2
"otection fault") at ../../../../kern/subr_prf.c:256
#3  0x808a84d6 in trap (frame=0xfe80b9fc1e30) at
../../../../arch/amd64/amd64/trap.c:298
#4  0x80100f46 in alltraps ()
#5  0x806b94b4 in nfs_sigintr (nmp=0x0, rep=0xfe81163730a8,
l=0x0) at ../../../../nfs/nfs_socket.c:871
#6  0x806b9b0e in nfs_timer (arg=) at
../../../../nfs/nfs_socket.c:752
#7  0x805e9458 in callout_softclock (v=) at
../../../../kern/kern_timeout.c:736
#8  0x805df84a in softint_execute (l=,
s=, si=) at
../../../../kern/kern_softint.c:589
#9  softint_dispatch (pinned=, s=2) at
../../../../kern/kern_softint.c:871
#10 0x8011402f in Xsoftintr ()

(gdb) kvm proc 0xfe813fb39860
nfs_timer (arg=) at ../../../../nfs/nfs_socket.c:735
735 {
(gdb) bt
#0  nfs_timer (arg=) at ../../../../nfs/nfs_socket.c:735
#1  0x in ?? ()

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


signature.asc
Description: PGP signature


NFS related panic? (was: Re: Killing a zombie process?)

2015-10-19 Thread Rhialto
On Fri 16 Oct 2015 at 16:31:18 +0200, J. Hannken-Illjes wrote:
> On 16 Oct 2015, at 13:44, Rhialto  wrote:
> 
> > "Interesting" results: it built packages overnight (from around 22:30 to
> > 12:13, so for nearly 14 hours), then, when I didn't look, it rebooted.
> 
> With panic?

I re-tried and with a pure GENERIC 7.0 kernel it happened again and now
I have a crash dump. Its dmesg ends with this:

nfs server 10.0.0.16:/mnt/scratch: not responding
nfs server 10.0.0.16:/mnt/scratch: is alive again
fatal page fault in supervisor mode
trap type 6 code 0 rip 806b94b4 cs 8 rflags 10246 cr2 38 ilevel 2 rsp ff
fffe80b9fc1f28
curlwp 0xfe813fb39860 pid 0.5 lowest kstack 0xfe80b9fbf2c0
panic: trap
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
callout_softclock() at netbsd:callout_softclock+0x248
softint_dispatch() at netbsd:softint_dispatch+0x79
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe80b9fc1ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
cpu0: End traceback...

dumping to dev 0,1 (offset=199775, size=1023726):


pid 0.5 is this:

PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
0>   5 7   0   200   fe813fb39860  softclk/0

gdb (without debugging symbols) so far thinks this is in nfs_timer():

(gdb) kvm proc 0xfe813fb39860
0x806b9aab in nfs_timer ()
(gdb) bt
#0  0x806b9aab in nfs_timer ()
#1  0x in ?? ()

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


signature.asc
Description: PGP signature


Re: Killing a zombie process?

2015-10-16 Thread Rhialto
On Thu 15 Oct 2015 at 20:12:44 +0200, Rhialto wrote:
> On Thu 15 Oct 2015 at 06:57:42 +0700, Robert Elz wrote:
> > Do you really need that mounted twice like that, and if not, can you try
> > with one of them missing and see if the problem remains ?
> 
> Good idea, I'll try that later!

"Interesting" results: it built packages overnight (from around 22:30 to
12:13, so for nearly 14 hours), then, when I didn't look, it rebooted.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgpjZl4aS0WfY.pgp
Description: PGP signature


Re: Killing a zombie process?

2015-10-16 Thread Rhialto
On Fri 16 Oct 2015 at 16:29:55 +0200, J. Hannken-Illjes wrote:
> Looks like we are waiting for a NFS operation to complete.
> 
> Did the machine hang here?

No, but I didn't try specifically to access the nfs volumes.

Interestingly enough, after the reboot (which used the stock 7.0 GENERIC
kernel) I'm back at a hanging build, at a point which even more points
to NFS:

13:02 cd /usr/pkgsrc/print/teTeX3-texmf && /usr/bin/make update CLEANDEPENDS=yes
===> do-bin-install [teTeX-texmf-3.0nb56] ===> Binary install for 
teTeX-texmf-3.0nb56
=> Installing teTeX-texmf-3.0nb56 from /pkg_comp/packages/All
pkg_add: no pkg found for 'teTeX-texmf-3.0nb56', sorry.
pkg_add: 1 package addition failed
=> No binary package found for teTeX-texmf-3.0nb56; installing from source.
load: 1.00  cmd: sh 22134 [wait] 0.00u 0.00s 0% 1424k
make[1]: Working in: /usr/pkgsrc/print/teTeX3-texmf
make: Working in: /usr/pkgsrc/print/teTeX3-texmf
make[2]: Working in: /usr/pkgsrc/print/teTeX3-texmf
load: 1.00  cmd: sh 22134 [wait] 0.00u 0.00s 0% 1424k
make[1]: Working in: /usr/pkgsrc/print/teTeX3-texmf
make: Working in: /usr/pkgsrc/print/teTeX3-texmf
make[2]: Working in: /usr/pkgsrc/print/teTeX3-texmf
load: 0.03  cmd: sh 22134 [wait] 0.00u 0.00s 0% 1424k
make[2]: Working in: /usr/pkgsrc/print/teTeX3-texmf
make: Working in: /usr/pkgsrc/print/teTeX3-texmf
make[1]: Working in: /usr/pkgsrc/print/teTeX3-texmf

and no process in tstile. But now I am trying to access all the NFS
mounts and they all manage to do at least an ls and du (at least for a
few seconds until interrupted).

Now I'm interrupting the make, which gives me a shell prompt back but
not all is working:

^C
pkg_comp:default70.conf# 
pkg_comp:default70.conf# 
pkg_comp:default70.conf# cd usr/pkgsrc/print/teTeX3-texmf
pkg_comp:default70.conf# make clean
load: 0.78  cmd: sh 28490 [wait] 0.00u 0.00s 0% 1424k
make: Working in: /usr/pkgsrc/print/teTeX3-texmf

The full du I was doing for the mount of /usr/pkgsrc is now also
stalled.

I think we can conclude from this that indeed it is some NFS problem.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgp1LAPQggHie.pgp
Description: PGP signature


Re: Killing a zombie process?

2015-10-16 Thread J. Hannken-Illjes
On 15 Oct 2015, at 00:21, Rhialto  wrote:

> On Wed 14 Oct 2015 at 09:39:40 +0200, J. Hannken-Illjes wrote:
>> Looks like a deadlock, two threads in tstile.
>> 
>> Please take a backtrace (with arguments) of these threads.
> 
> I've got a whole lot more in tstile, and that is even just from running
> pkg_comp in the chroot. I didn't try to interrupt anything yet.
> 
> load averages:  0.00,  0.20,  0.44;   up 0+02:23:43
> 22:43:52
> 78 processes: 76 sleeping, 2 on CPU
> CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> Memory: 393M Act, 60K Inact, 31M Wired, 31M Exec, 273M File, 3239M Free
> Swap: 4096M Total, 4096M Free
> 
> 
> vargaz:~$ ps alxtp1
> UID   PID  PPID   CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
> 1000  139174 0  85  0 13208  2528 waitIs   ttyp1 0:00.02 -bash
>   0  1759  1391  1107  85  0 13304  1576 waitIttyp1 0:00.13 /bin/sh 
> /usr/pkg/sbin/pkg_comp chroot
>   0   865  1759  1107  85  0 13304  1140 waitIttyp1 0:00.01 /bin/sh 
> /pkg_comp/tmp/pkg_comp-sOjsoA.sh
>   0   874   865 13547  82  0 11088  1412 pause   Ittyp1 0:00.01 /bin/ksh
>   0   267   874 20048  81  0 15360  1720 waitI+   ttyp1 0:00.22 /bin/sh 
> -e /usr/pkg/sbin/pkg_chk
>   0  9782   267 20048  81  0 15360  1448 waitI+   ttyp1 0:00.00 sh -c cd 
> /usr/pkgsrc/devel/mercurial && /usr/bin/make u
>   0  8085  9782 0 117  0 15224  3452 tstile  D+   ttyp1 0:00.14 
> /usr/bin/make update CLEANDEPENDS
>   0 26889  8085 29745  78  0 15360  1424 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e; /usr/bin/env MAKECONF=/etc/mk.conf P
>   0 14050 26889 0 117  0 15224  3444 tstile  D+   ttyp1 0:00.14 
> /usr/bin/make _MAKE OPSYS OS_VERSION LOWER_OPSYS _PKGSR
>   0  6325 14050 22699  80  0 15360  1428 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e; pkgpattern=mercurial-3.5.1;\t\t\t\t
>   0 13334  6325 0 117  0 15224  3452 tstile  D+   ttyp1 0:00.14 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS HOST_OSTYPE
>   0  2892 13334 29745  78  0 15364  1444 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e;\t\t\t\t\t\t\t\t exec 3<&0;\t\t\t\t\t
>   0 13425  2892 29745  78  0 15364  1136 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e;\t\t\t\t\t\t\t\t exec 3<&0;\t\t\t\t\t
>   0 17339 13425 0 117  0 15224  3504 tstile  D+   ttyp1 0:00.16 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG
>   0 11893 17339 23601  80  0 15364  1432 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e; pkgpattern=py27-mercurial\\>=3.5.1;\
>   0 21797 11893 0 117  0 15228  3512 tstile  D+   ttyp1 0:00.18 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG
>   0  1347 21797 23778  80  0 15364  1456 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e;\t\t\t\t\t if test -n "" &&  /usr/pkg
>   0 23567  1347 0 117  0 15228  4032 tstile  D+   ttyp1 0:00.38 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG
>   0  3383 23567 29360  78  0 15364  1432 waitI+   ttyp1 0:00.00 /bin/sh 
> -c (cd /pkg_comp/obj/pkgsrc/devel/py-mercurial/
>   0 21311  3383 28277  79  0 81652 11580 waitI+   ttyp1 0:00.14 
> /usr/pkg/bin/python2.7 setup.py build
>   0 24114 21311 28277  79  0 15364  1424 waitI+   ttyp1 0:00.01 /bin/sh 
> /pkg_comp/obj/pkgsrc/devel/py-mercurial/default
>   0  3590 24114 28277  79  0 15364  1472 waitI+   ttyp1 0:00.00 /bin/sh 
> /usr/pkgsrc/mk/tools/msgfmt.sh
>   0  7060  3590 28277 117  0  4244   188 tstile  D+   ttyp1 0:00.00 /bin/cat
>   0 18497  3590 28277  79  0 10880  1064 pipe_wr I+   ttyp1 0:00.00 /bin/cat 
> i18n/el.po
>   0 23883  3590 0 117  0  6580   236 netio   D+   ttyp1 0:00.00 
> /usr/bin/msgfmt -v -o mercurial/locale/el/LC_MESSAGES/h
>   0 27257  3590 28277 117  0  4244   188 tstile  D+   ttyp1 0:00.00 /bin/cat
>   0 29472  3590 28277  79  0 14244  2344 pipe_wr I+   ttyp1 0:00.01 
> /usr/bin/awk -f /usr/bin/awk
> 
> (I've re-arranged the order to get parents before children)
> 
> Here are backtraces of the processes in tstile (and the shell that
> spawned the 4 leaf children). I have kept the dump so I can examine it
> further.
> 
> Unfortunately, crash(8) didn't give me arguments, nor did ddb when I
> tried that (I used the GENERIC kernel, what options do I need to get the
> arguments?)
> 
> Script started on Wed Oct 14 23:41:43 2015
> vargaz:~/crash$ crash -M netbsd.3.core -N netbsd.test
> Crash version 7.0, image version 7.99.21.
> WARNING: versions differ, you may not be able to examine this image.
> System panicked: dump forced via kernel debugger
> Backtrace from time of crash is available.
> 
> 
> crash> bt/t 0t3590
> trace: pid 3590 lid 1 at 0xfe8040758d00
> sleepq_block() at sleepq_block+0xa2
> cv_wait_sig() at cv_wait_sig+0xfe
> do_sys_wait() at do_sys_wait+0x22c
> sys___wait450() at sys___wait450+0x3a
> syscall() at syscall+0x9c
> --- syscall (number 449) ---
> 7f7ff683c1ea:
> 
> 
> crash> bt/t 0t7060
> trace: pid 7060 lid 1 at 0xfe804076c770
> sleepq_block() at 

Re: Killing a zombie process?

2015-10-16 Thread Rhialto
On Fri 16 Oct 2015 at 16:31:18 +0200, J. Hannken-Illjes wrote:
> On 16 Oct 2015, at 13:44, Rhialto  wrote:
> 
> > On Thu 15 Oct 2015 at 20:12:44 +0200, Rhialto wrote:
> >> On Thu 15 Oct 2015 at 06:57:42 +0700, Robert Elz wrote:
> >>> Do you really need that mounted twice like that, and if not, can you try
> >>> with one of them missing and see if the problem remains ?
> >> 
> >> Good idea, I'll try that later!
> > 
> > "Interesting" results: it built packages overnight (from around 22:30 to
> > 12:13, so for nearly 14 hours), then, when I didn't look, it rebooted.
> 
> With panic?

This was logged; for some reason savecore claims there is no dump
though.

Oct 15 19:47:32 vargaz /netbsd: NetBSD 7.99.21 (GENERIC) #1: Wed Oct 14 
01:52:52 CEST 2015

Oct 16 12:15:16 vargaz syslogd[798]: restart
Oct 16 12:15:16 vargaz /netbsd: fatal page fault in supervisor mode
Oct 16 12:15:16 vargaz /netbsd: trap type 6 code 0 rip 80714eed cs 8 
rflags 10246 cr2 38 ilevel 2 rsp fe80b9fc6f10
Oct 16 12:15:16 vargaz /netbsd: curlwp 0xfe813fb34860 pid 0.5 lowest kstack 
0xfe80b9fc32c0
Oct 16 12:15:16 vargaz /netbsd: panic: trap
Oct 16 12:15:16 vargaz /netbsd: cpu0: Begin traceback...
Oct 16 12:15:16 vargaz /netbsd: vpanic() at netbsd:vpanic+0x13c
Oct 16 12:15:16 vargaz /netbsd: snprintf() at netbsd:snprintf
Oct 16 12:15:16 vargaz /netbsd: startlwp() at netbsd:startlwp
Oct 16 12:15:16 vargaz /netbsd: alltraps() at netbsd:alltraps+0x9e
Oct 16 12:15:16 vargaz /netbsd: callout_softclock() at 
netbsd:callout_softclock+0x392
Oct 16 12:15:16 vargaz /netbsd: softint_dispatch() at 
netbsd:softint_dispatch+0xd3
Oct 16 12:15:16 vargaz /netbsd: DDB lost frame for netbsd:Xsoftintr+0x4f, 
trying 0xfe80b9fc6ff0
Oct 16 12:15:16 vargaz /netbsd: Xsoftintr() at netbsd:Xsoftintr+0x4f
Oct 16 12:15:16 vargaz /netbsd: --- interrupt ---
Oct 16 12:15:16 vargaz /netbsd: 0:
Oct 16 12:15:16 vargaz /netbsd: cpu0: End traceback...
Oct 16 12:15:16 vargaz /netbsd: 
Oct 16 12:15:16 vargaz /netbsd: dumping to dev 0,1 (offset=199775, 
size=1023726):
Oct 16 12:15:16 vargaz /netbsd: dump succeeded
Oct 16 12:15:16 vargaz /netbsd: 
Oct 16 12:15:16 vargaz /netbsd: 
Oct 16 12:15:16 vargaz /netbsd: rebooting...
Oct 16 12:15:16 vargaz /netbsd: Copyright (c) 1996, 1997, 1998, 1999, 2000, 
2001, 2002, 2003, 2004, 200


-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgpAW2dZkzxc5.pgp
Description: PGP signature


Re: Killing a zombie process?

2015-10-16 Thread J. Hannken-Illjes
On 16 Oct 2015, at 13:44, Rhialto  wrote:

> On Thu 15 Oct 2015 at 20:12:44 +0200, Rhialto wrote:
>> On Thu 15 Oct 2015 at 06:57:42 +0700, Robert Elz wrote:
>>> Do you really need that mounted twice like that, and if not, can you try
>>> with one of them missing and see if the problem remains ?
>> 
>> Good idea, I'll try that later!
> 
> "Interesting" results: it built packages overnight (from around 22:30 to
> 12:13, so for nearly 14 hours), then, when I didn't look, it rebooted.

With panic?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Killing a zombie process?

2015-10-15 Thread Rhialto
On Thu 15 Oct 2015 at 06:57:42 +0700, Robert Elz wrote:
> I do wonder about ...
> 
>   | procfs on /usr/pkg/emul/linux32/proc type procfs (read-only, local)
>   | procfs on /usr/pkg/emul/linux32/proc type procfs (local)

Ah good catch. That seems to be a botched attempt to mount the linux
procfs /inside/ the chroot. But I made a typo somewhere.
That used to be needed in the past for building some package I think,
but since I don't have it on my main machine that is probably not true
any more.

> Especially since some of the tstile processes you showed are doing
> lookups under namei_tryemulroot()

However the getcwds should not come near that proc directory, since it
is outside the chroot.

> Do you really need that mounted twice like that, and if not, can you try
> with one of them missing and see if the problem remains ?

Good idea, I'll try that later!

> kre
-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgps_cMO0dZAW.pgp
Description: PGP signature


Re: Killing a zombie process?

2015-10-14 Thread J. Hannken-Illjes

On 14 Oct 2015, at 00:20, Rhialto  wrote:

> I may have something similar; with 7.0/amd64 GENERIC kernel.
> 
> I've been doing builds in pkg_comp with the chroot directory and /usr/pkgsrc
> mounted over nfs. After some packages, some processes simply don't terminate.
> 
> Some of my processes are now (after trying to exit pkg_comp which hangs)
> 
> UID   PID  PPID  CPU PRI NIVSZ   RSS WCHAN   STAT TTY   TIME COMMAND
>   0   402 10  85  0  15360  1428 waitIpts/2  0:00.00 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
> 1000   683  29070  85  0  13224  2588 waitIs   pts/2  0:00.03 -bash
>   0  2847   683  257 117  0  13304  1576 tstile  D+   pts/2  0:00.02 /bin/sh 
> /usr/pkg/sbin/pkg_comp chroot
>   0 14284 10  85  0  15360  1428 waitIpts/2  0:00.00 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
>   0 26291   402  708 117  0  15360  1004 tstile  Dpts/2  0:00.00 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
>   0 28266 142840 116  0  15360  1004 netio   Dpts/2  0:00.01 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
> 
> No zombies involved, though.

Looks like a deadlock, two threads in tstile.

Please take a backtrace (with arguments) of these threads.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Killing a zombie process?

2015-10-14 Thread Rhialto
On Thu 15 Oct 2015 at 00:21:55 +0200, Rhialto wrote:
> I've got a whole lot more in tstile, and that is even just from running
> pkg_comp in the chroot. I didn't try to interrupt anything yet.

I forgot to mention that this is with a kernel cvs'ed about 24 hours
ago. So this issue isn't the same as the zombie-counting issue.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgppbtv6tsBL9.pgp
Description: PGP signature


Re: Killing a zombie process?

2015-10-14 Thread Robert Elz
Date:Thu, 15 Oct 2015 00:21:55 +0200
From:Rhialto 
Message-ID:  <20151014222155.ga25...@falu.nl>

First, I agree this has nothing at all do do with the zombie refcount
issue (nothing to do with zombies, or process lists or anything slightly
related).

I do wonder about ...

  | procfs on /usr/pkg/emul/linux32/proc type procfs (read-only, local)
  | procfs on /usr/pkg/emul/linux32/proc type procfs (local)

Especially since some of the tstile processes you showed are doing
lookups under namei_tryemulroot()

Do you really need that mounted twice like that, and if not, can you try
with one of them missing and see if the problem remains ?

kre



Re: Killing a zombie process?

2015-10-13 Thread Rhialto
I may have something similar; with 7.0/amd64 GENERIC kernel.

I've been doing builds in pkg_comp with the chroot directory and /usr/pkgsrc
mounted over nfs. After some packages, some processes simply don't terminate.

Some of my processes are now (after trying to exit pkg_comp which hangs)

 UID   PID  PPID  CPU PRI NIVSZ   RSS WCHAN   STAT TTY   TIME COMMAND
   0   402 10  85  0  15360  1428 waitIpts/2  0:00.00 /bin/sh 
-c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
-print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f 
1000   683  29070  85  0  13224  2588 waitIs   pts/2  0:00.03 -bash 
   0  2847   683  257 117  0  13304  1576 tstile  D+   pts/2  0:00.02 /bin/sh 
/usr/pkg/sbin/pkg_comp chroot 
   0 14284 10  85  0  15360  1428 waitIpts/2  0:00.00 /bin/sh 
-c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
-print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f 
   0 26291   402  708 117  0  15360  1004 tstile  Dpts/2  0:00.00 /bin/sh 
-c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
-print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f 
   0 28266 142840 116  0  15360  1004 netio   Dpts/2  0:00.01 /bin/sh 
-c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
-print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f 

No zombies involved, though.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgpEteqPwgwva.pgp
Description: PGP signature


Re: Killing a zombie process?

2015-10-04 Thread Paul Goyette

On Sun, 4 Oct 2015, Robert Elz wrote:


   Date:Sun, 4 Oct 2015 17:25:21 +0800 (PHT)
   From:Paul Goyette 
   Message-ID:  

 | I'm pretty much convinced that the p_nstopchild accounting is screwed up
 | somewhere.

I think I agree.

 | I'm planning on adding the following code in "optimization"
 | in kern_exit so I can catch it as soon as it happens.

Sooner, but unfortunately, most probably not soon enough.

It is most likely some locking/race condition with multiple processes
dying at the same time (approximately) that is causing some of the
increments to be lost.   Making them all use atomic ops, instead of just ++
might fix the problem, at the cost of never discovering where issue
actually occurs - there should be locks around all manipulations of
this stuff, possibly one of them is missing or misplaced.


Yeah, I think that there's a basic accounting problem somewhere, and 
with an extreme load it is more likely for the SSTOPed process to get 
inserted in the p_children/p_sibling list before the SZOMB process can 
get reaped.  Once the SSTOPed process gets to front-of line (with the 
parent's p_nstopchild count zero), the SZOMB process won't ever get 
processed.  My patch will simply validate this theory.


(BTW, the patch is actually wrong, as it would also panic in the case 
where the wait was for a specific pid.  I've modified it in my new 
kernel - not yet tested.)



It is unlikely to be in the wait processing (at least not this one) as
there's just one process doing the waiting, there would be no contention
for the accesses here (it could be a combination of the two though,
wait() happening at the same instant a process is dying).


See above.


I'm also puzzled by your observations of forked init processes having
exited - after rc is finished, init generally only forks when one of the
console/terminal sessions ends, and a new getty needs to be started.
On most modern systems, that's a very rare event - though if you use
the console (ctl-alt-Fn or whatever it is) switching, and login and out
of those (virtual) terminals, it would happen.  Is there anything like
that in your environment?


I do occassionally switch to another wsdisplay screen (away from the X 
one), but not frequently.  I definitely do a switch before I use 
Ctrl/Alt/Esc to get into ddb.


I'm wondering if some (most? all?) of the SSTOPd processes I see are a 
result of entering ddb and/or triggering the reboot?  Doesn't ddb need 
to stop whatever is running on "the other CPU cores" ?




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-10-04 Thread Paul Goyette
I'm pretty much convinced that the p_nstopchild accounting is screwed up 
somewhere.  I'm planning on adding the following code in "optimization" 
in kern_exit so I can catch it as soon as it happens.


Basically, if the optimization would cause us to stop looking for a 
process to report, this hack/patch will just scan the rest of the 
sibling list.  If it finds a zombie that should be reported, it will 
panic, and I'll have pointers to both the zombie and the process at 
which the optimization occurred.


Comments?


Index: kern_exit.c
===
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.245
diff -u -p -r1.245 kern_exit.c
--- kern_exit.c 2 Oct 2015 16:54:15 -   1.245
+++ kern_exit.c 4 Oct 2015 09:15:00 -
@@ -788,6 +788,14 @@ find_stopped_child(struct proc *parent,
break;
}
if (parent->p_nstopchild == 0 || child->p_pid == pid) {
+/* XXX */
+   struct proc *nxtchild = child;
+   while (nxtchild = LIST_NEXT(nxtchild, p_sibling)
+   if (nxtchild->p_stat == SZOMB)
+   panic("Zombie %p not reaped - "
+   "scan stopped at proc %p",
+   nxtchild, child);
+/* XXX */
child = NULL;
break;
}



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-10-04 Thread Robert Elz
Date:Sun, 4 Oct 2015 17:25:21 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | I'm pretty much convinced that the p_nstopchild accounting is screwed up 
  | somewhere.

I think I agree.

  | I'm planning on adding the following code in "optimization" 
  | in kern_exit so I can catch it as soon as it happens.

Sooner, but unfortunately, most probably not soon enough.

It is most likely some locking/race condition with multiple processes
dying at the same time (approximately) that is causing some of the
increments to be lost.   Making them all use atomic ops, instead of just ++
might fix the problem, at the cost of never discovering where issue
actually occurs - there should be locks around all manipulations of
this stuff, possibly one of them is missing or misplaced.

It is unlikely to be in the wait processing (at least not this one) as
there's just one process doing the waiting, there would be no contention
for the accesses here (it could be a combination of the two though,
wait() happening at the same instant a process is dying).

I'm also puzzled by your observations of forked init processes having
exited - after rc is finished, init generally only forks when one of the
console/terminal sessions ends, and a new getty needs to be started.
On most modern systems, that's a very rare event - though if you use
the console (ctl-alt-Fn or whatever it is) switching, and login and out
of those (virtual) terminals, it would happen.  Is there anything like
that in your environment?

kre



Re: Killing a zombie process?

2015-10-04 Thread Paul Goyette

On Sun, 4 Oct 2015, Paul Goyette wrote:


 | 1. Is it correct for init's p_nstopchild to be zero when it has several
 | children whose p_state is SSTOP?

Depends whether those children have previously been waited for or not.
Stopped children don't go away when they're waited for, so there needs
to be something to prevent wait() returning the same stopped child
over and over again.   That's p_waited ... so you need to check that
value of the stopped children, if it is 0, then something is broken.
If it is 1 (for all of them) then they're irrelevant, and matter not
at all.



Here's another instance of the problem.  (Note that I'm limping along 
with crash(8) here since gdb isn't cooperating at the moment.)


crash> show proc 1
init: pid 1 proc fe810f46ecd0 vmspace/map fe810f483e60 flags 4001
  lwp 1 fe810f476a60 pcb fe810f464000
stat 2 flags 802 cpu 0 pri 43
crash> x/x 0xfe810f46ecd0+0x130
fe810f46ee00:   0   p_nstopchild == 0
crash> x/x 0xfe810f46ecd0+0x100,2
fe810f46edd0:   7b5f5800fe80p_children listhead

Looking at the first child...

crash> x/x 0xfe807b5f5800+0xd0
fe807b5f58d0:   4   p_stat == SSTOP
crash>
fe807b5f58d4:   6f68p_pid
crash> show proc 0x6f68
init: pid 28520 proc fe807b5f5800 vmspace/map fe807e7be480 flags 0
  lwp 1 fe811e636300 pcb fe81aae19000
stat 2 flags 802 cpu 3 pri 43
crash> x/x 0xfe807b5f5800+0x134
fe807b5f5934:   0   p_waited == 0
crash> x/x 0xfe807b5f5800+0xf0,2
fe807b5f58f0:   f46e520 fe81p_sibling.le_next

So, the first child of init appears to be another instance of init, and 
its state is SSTOP.  It has not been waited for, yet its parent (the 
"real" init, pid=1) has a zero count for p_nstopchild.



This problem is easily reproduced, but only under heavy-load conditions. 
On a amd64 (CPU = Intel i5-4460 @ 3.20GHz) 7.99.21 I've been running a 
'build.sh -j3 release' in parallel with a series of pkgsrc builds 
running with MAKE_JOBS=3;  it takes from 30 to 60 minutes of this before 
the Zombie appears. (The pkgsrc builds are running in chroot created by 
pkgsrc/sysutils/mksandbox.)



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-10-04 Thread Robert Elz
Date:Sun, 4 Oct 2015 20:52:43 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | I do occassionally switch to another wsdisplay screen (away from the X 
  | one), but not frequently.  I definitely do a switch before I use 
  | Ctrl/Alt/Esc to get into ddb.

OK, that could explain the forked init.

  | I'm wondering if some (most? all?) of the SSTOPd processes I see are a 
  | result of entering ddb and/or triggering the reboot?  Doesn't ddb need 
  | to stop whatever is running on "the other CPU cores" ?

No, not that kind of stop.

kre

ps: you might want to try fixing PR 50298 (that I just submitted) and see
if that makes a difference - I think the chances are about one in infinity,
but ...



Re: Killing a zombie process?

2015-10-03 Thread Robert Elz
Date:Fri, 2 Oct 2015 15:26:42 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | 1. Is it correct for init's p_nstopchild to be zero when it has several
  | children whose p_state is SSTOP?

Depends whether those children have previously been waited for or not.
Stopped children don't go away when they're waited for, so there needs
to be something to prevent wait() returning the same stopped child
over and over again.   That's p_waited ... so you need to check that
value of the stopped children, if it is 0, then something is broken.
If it is 1 (for all of them) then they're irrelevant, and matter not
at all.

  | 2. Is the above code in init correct?  Should we really be leaving the
  | loop when there are more children to examine?

It is an optimisation, and should be correct.

However, it dpes depend upon p_nstoppedchild being maintained correctly.

You didn't say whether your zombie process is actually to be found
(somewhere) on the parent's (ie: init's) list of children.

I have no idea how one would discover this (at this point, or given
how long you need to wait for it to happen, perhaps ever) but it would
also be interesting to know whether the zombie was reparented to init
before or after it died.

The common case is for a parent to exit, leaving running children, which
are reparented to init, complete, exit, and init cleans them up.

But it is also possible for a child to die, be ignored by its parent,
which later exit itself, leaving the zombie to be reparented to init.
That's more unusual - does not happen very often, but if that is
what happened here, it is possible that there's some bug in the processing
of that case.

kre



Re: Killing a zombie process?

2015-10-03 Thread Paul Goyette

On Sun, 4 Oct 2015, Robert Elz wrote:


   Date:Fri, 2 Oct 2015 15:26:42 +0800 (PHT)
   From:Paul Goyette 
   Message-ID:  

 | 1. Is it correct for init's p_nstopchild to be zero when it has several
 | children whose p_state is SSTOP?

Depends whether those children have previously been waited for or not.
Stopped children don't go away when they're waited for, so there needs
to be something to prevent wait() returning the same stopped child
over and over again.   That's p_waited ... so you need to check that
value of the stopped children, if it is 0, then something is broken.
If it is 1 (for all of them) then they're irrelevant, and matter not
at all.


All of those head-of-sibling-list processes were p_stat == SSTOP and 
p_waited=0, and none of them has (p_slflag & PSL_TRACED).  And, since 
init(8) is calling waitpid( ..., ..., 0), the value of options is zero 
so the following code (immediately before the previously-quoted code, at 
src/sys/kern/kern_exit.c:780) doesn't trigger:


if (child->p_stat == SSTOP &&
child->p_waited == 0 &&
(child->p_slflag & PSL_TRACED ||
options & WUNTRACED)) {
if ((options & WNOWAIT) == 0) {
child->p_waited = 1;
parent->p_nstopchild--;
}
break;
}

So "something is broken" ?  :)


The waitpid() call in init is at src/sbin/init/init.c:1506.  Since my 
Zombie does finally die during a transition from multi-user back down to 
single-user, I'm guessing that one of the other calls to waitpid() is 
clearing out the SSTOPed processes at the head of the p_sibling list, 
perhaps the call in single_user() at line 773?


...
requested_transition = 0;
do {
if ((wpid = waitpid(-1, , WUNTRACED)) != -1)
collect_child(wpid, status);
...




 | 2. Is the above code in init correct?  Should we really be leaving the
 | loop when there are more children to examine?

It is an optimisation, and should be correct.

However, it dpes depend upon p_nstoppedchild being maintained correctly.

You didn't say whether your zombie process is actually to be found
(somewhere) on the parent's (ie: init's) list of children.


Yes, the zombie was the eighth entry on init's p_sibling list.

Several of the front-of-list processes appeared to be related to some 
system daemons.  (One was related to consolekit, one to dbus.)  And the 
very first child of init seems to have been another copy of init (based 
on its p_comm[] field)!




I have no idea how one would discover this (at this point, or given
how long you need to wait for it to happen, perhaps ever) but it would
also be interesting to know whether the zombie was reparented to init
before or after it died.

The common case is for a parent to exit, leaving running children, which
are reparented to init, complete, exit, and init cleans them up.

But it is also possible for a child to die, be ignored by its parent,
which later exit itself, leaving the zombie to be reparented to init.
That's more unusual - does not happen very often, but if that is what
happened here, it is possible that there's some bug in the processing
of that case.


Hmmm, probably not possible to differentiate at this point.



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-10-02 Thread Paul Goyette

On Fri, 2 Oct 2015, Paul Goyette wrote:


For now, I took a quick look into the zombie's struct proc.

p_exitsig = 0x14   = SIGCHILD
p_flag= 0x0
p_sflag   = 0x2000 = PS_WEXIT
p_slflag  = 0x0
p_lflag   = 0x2= PL_CONTROLT
p_stflag  = 0x0
p_stat= 0x5= SZOMB

p_trace_enabled = 0x0
p_pid = 0x5280 = 21120 (the same value shown by ps)

I don't see anything unusual here.

I have attached the hex-dump in case anyone wants to look a little bit 
closer.


OK, I forced a system crash (using ddb's sync command), and here's what 
gdb says about the zombie's struct proc (manually inserted line breaks 
for improved readability, and some flag value annotations)


(gdb) print (struct proc *) 0xfe81f578ba70
$1 = (struct proc *) 0xfe81f578ba70
(gdb) print *(struct proc *) 0xfe81f578ba70
$2 = {
  p_list = {le_next = 0x0, le_prev = 0x806be700 },
  p_auxlock = {u = {mtxa_owner = 0}},
  p_lock = 0xfe81fbb7a840,
  p_stmutex = {u = {mtxa_owner = 2049}},
  p_reflock = {rw_owner = 0},
  p_waitcv = {cv_opaque = {0x0, 0xfe81f578baa0, 0x804d542e}},
  p_lwpcv = {cv_opaque = {0x0, 0xfe81f578bab8, 0x804e7f9a}},
  p_cred = 0xfe81ef0106c0,
  p_fd = 0xfe810f46f680,
  p_cwdi = 0x0,
  p_stats = 0xfe81e00b5700,
  p_limit = 0xfe8155fe8de8,
  p_vmspace = 0x80722de0 ,
  p_sigacts = 0xfe803be9b258,
  p_aio = 0x0,
  p_mqueue_cnt = 0,
  p_specdataref = {
specdataref_container = 0x0,
specdataref_lock = {u = {mtxa_owner = 18446744073709551600}}},
  p_exitsig = 20,
  p_flag = 0,
  p_sflag = 8192 ,
  p_slflag = 0,
  p_lflag = 2 ,
  p_stflag = 0,
  p_stat = 5 '\005' ,
  p_trace_enabled = 0 '\000',
  p_pad1 = "\203",
  p_pid = 21120,
  p_pglist = {
le_next = 0x0,
le_prev = 0xfe81eab655b0},
  p_pptr = 0xfe810f45ecd0,
  p_sibling = {
le_next = 0xfe81f7618d20, le_prev = 0xfe81fc805108},
  p_children = {lh_first = 0x0},
  p_lwps = {lh_first = 0xfe8021ccb560},
  p_raslist = 0x0,
  p_nlwps = 1,
  p_nzlwps = 1,
  p_nrlwps = 0,
  p_nlwpwait = 0,
  p_ndlwps = 0,
  p_nlwpid = 1,
  p_nstopchild = 0,
  p_waited = 0,
  p_zomblwp = 0x0,
  p_vforklwp = 0x0,
  p_sched_info = 0x0,
  p_estcpu = 0,
  p_estcpu_inherited = 36864,
  p_forktime = 17842,
  p_pctcpu = 0,
  p_opptr = 0x0,
  p_timers = 0x0,
  p_rtime = {sec = 0, frac = 0},
  p_uticks = 0,
  p_sticks = 0,
  p_iticks = 0,
  p_traceflag = 0,
  p_tracep = 0x0,
  p_textvp = 0xfe81e6023190,
  p_emul = 0x806b6300 ,
  p_emuldata = 0x0,
  p_execsw = 0x808be0e0,
  p_klist = { slh_first = 0x0},
  p_sigwaiters = {lh_first = 0x0},
  p_sigpend = {
sp_info = {tqh_first = 0x0, tqh_last = 0xfe81f578bc48},
sp_set = {__bits = {0, 0, 0, 0}}},
  p_lwpctl = 0x0,
  p_ppid = 1,
  p_fpid = 0,
  p_sigctx = {
ps_signo = 0, ps_code = 0, ps_lwp = 0, ps_sigcode = 0x0,
ps_sigignore = {__bits = {4294967295, 4294967295, 4294967295, 4294967295}},
ps_sigcatch = {__bits = {0, 0, 0, 0}}},
  p_nice = 20 '\024',
  p_comm = "sh\000ke", '\000' ,
  p_pgrp = 0xfe81eab655b0,
  p_psstrp = 140187732541408,
  p_pax = 0,
  p_xstat = 0,
  p_acflag = 1,
  p_md = {md_flags = 0, md_syscall = 0x8012f010 },
  p_stackbase = 140187732541440,
  p_dtrace = 0x7f7ff683b8e6}

As far as I can tell, everything looks normal.  Yet the process never 
gets reaped by init.


The one thing that surprises me here is that the zombie still has a 
pointer to p_textvp which would point to /bin/sh _within_ the chroot() 
sandbox (consistent with the p_comm = "sh" entry).  I'm guessing that 
this reference is what's preventing me from unmounting this nullfs 
mount.  (I previously expected the inability to unmount to be the result 
of a reference from the zombie's cwd.)



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-10-02 Thread Paul Goyette

On Fri, 2 Oct 2015, Paul Goyette wrote:


On Fri, 2 Oct 2015, Paul Goyette wrote:


For now, I took a quick look into the zombie's struct proc.

p_exitsig = 0x14   = SIGCHILD
p_flag= 0x0
p_sflag   = 0x2000 = PS_WEXIT
p_slflag  = 0x0
p_lflag   = 0x2= PL_CONTROLT
p_stflag  = 0x0
p_stat= 0x5= SZOMB

p_trace_enabled = 0x0
p_pid = 0x5280 = 21120 (the same value shown by ps)

I don't see anything unusual here.

I have attached the hex-dump in case anyone wants to look a little bit 
closer.


OK, I forced a system crash (using ddb's sync command), and here's what gdb 
says about the zombie's struct proc (manually inserted line breaks for 
improved readability, and some flag value annotations)


(gdb) print (struct proc *) 0xfe81f578ba70
$1 = (struct proc *) 0xfe81f578ba70
(gdb) print *(struct proc *) 0xfe81f578ba70
$2 = {
 p_list = {le_next = 0x0, le_prev = 0x806be700 },
 p_auxlock = {u = {mtxa_owner = 0}},
 p_lock = 0xfe81fbb7a840,
 p_stmutex = {u = {mtxa_owner = 2049}},
 p_reflock = {rw_owner = 0},
 p_waitcv = {cv_opaque = {0x0, 0xfe81f578baa0, 0x804d542e}},
 p_lwpcv = {cv_opaque = {0x0, 0xfe81f578bab8, 0x804e7f9a}},
 p_cred = 0xfe81ef0106c0,
 p_fd = 0xfe810f46f680,
 p_cwdi = 0x0,
 p_stats = 0xfe81e00b5700,
 p_limit = 0xfe8155fe8de8,
 p_vmspace = 0x80722de0 ,
 p_sigacts = 0xfe803be9b258,
 p_aio = 0x0,
 p_mqueue_cnt = 0,
 p_specdataref = {
   specdataref_container = 0x0,
   specdataref_lock = {u = {mtxa_owner = 18446744073709551600}}},
 p_exitsig = 20,
 p_flag = 0,
 p_sflag = 8192 ,
 p_slflag = 0,
 p_lflag = 2 ,
 p_stflag = 0,
 p_stat = 5 '\005' ,
 p_trace_enabled = 0 '\000',
 p_pad1 = "\203",
 p_pid = 21120,
 p_pglist = {
   le_next = 0x0,
   le_prev = 0xfe81eab655b0},
 p_pptr = 0xfe810f45ecd0,
 p_sibling = {
   le_next = 0xfe81f7618d20, le_prev = 0xfe81fc805108},
 p_children = {lh_first = 0x0},
 p_lwps = {lh_first = 0xfe8021ccb560},
 p_raslist = 0x0,
 p_nlwps = 1,
 p_nzlwps = 1,
 p_nrlwps = 0,
 p_nlwpwait = 0,
 p_ndlwps = 0,
 p_nlwpid = 1,
 p_nstopchild = 0,
 p_waited = 0,
 p_zomblwp = 0x0,
 p_vforklwp = 0x0,
 p_sched_info = 0x0,
 p_estcpu = 0,
 p_estcpu_inherited = 36864,
 p_forktime = 17842,
 p_pctcpu = 0,
 p_opptr = 0x0,
 p_timers = 0x0,
 p_rtime = {sec = 0, frac = 0},
 p_uticks = 0,
 p_sticks = 0,
 p_iticks = 0,
 p_traceflag = 0,
 p_tracep = 0x0,
 p_textvp = 0xfe81e6023190,
 p_emul = 0x806b6300 ,
 p_emuldata = 0x0,
 p_execsw = 0x808be0e0,
 p_klist = { slh_first = 0x0},
 p_sigwaiters = {lh_first = 0x0},
 p_sigpend = {
   sp_info = {tqh_first = 0x0, tqh_last = 0xfe81f578bc48},
   sp_set = {__bits = {0, 0, 0, 0}}},
 p_lwpctl = 0x0,
 p_ppid = 1,
 p_fpid = 0,
 p_sigctx = {
   ps_signo = 0, ps_code = 0, ps_lwp = 0, ps_sigcode = 0x0,
   ps_sigignore = {__bits = {4294967295, 4294967295, 4294967295, 
4294967295}},

   ps_sigcatch = {__bits = {0, 0, 0, 0}}},
 p_nice = 20 '\024',
 p_comm = "sh\000ke", '\000' ,
 p_pgrp = 0xfe81eab655b0,
 p_psstrp = 140187732541408,
 p_pax = 0,
 p_xstat = 0,
 p_acflag = 1,
 p_md = {md_flags = 0, md_syscall = 0x8012f010 },
 p_stackbase = 140187732541440,
 p_dtrace = 0x7f7ff683b8e6}

As far as I can tell, everything looks normal.  Yet the process never gets 
reaped by init.


The one thing that surprises me here is that the zombie still has a pointer 
to p_textvp which would point to /bin/sh _within_ the chroot() sandbox 
(consistent with the p_comm = "sh" entry).  I'm guessing that this reference 
is what's preventing me from unmounting this nullfs mount.  (I previously 
expected the inability to unmount to be the result of a reference from the 
zombie's cwd.)


Still investigating, but I think I may have found something...

Using the p_pptr value 0xfe810f45ecd0 from the zombie's struct proc, 
I examined the struct proc for init.  I followed the code from the 
find_stopped_child() routine in src/sys/kern/kern_exit.c, and walked 
through the loop for each of init's children.  The first several 
processes are all in p_state=4 (SSTOP), yet init's p_nstopchild count is 
zero!


This seems to cause the loop in find_stopped_child() to exit early (at 
line 790):


 if (parent->p_nstopchild == 0 || child->p_pid == pid) {
 child = NULL;
 break;

(Here, parent points to init's struct proc, child is the struct proc 
obtained from walking the p_children list, and pid is the argument 
passed to the wait4() syscall - init passes value WAIT_ANY, ie -1.)


Questions:

1. Is it correct for init's p_nstopchild to be zero when it has several
   children whose p_state is SSTOP?

2. Is the above code in init correct?  Should we really be leaving the
   loop when there are more children to examine?





+--+--+-+
| Paul Goyette | PGP Key fingerprint:

Re: Killing a zombie process?

2015-10-01 Thread Paul Goyette

On Fri, 2 Oct 2015, Paul Goyette wrote:


Still trying to track this down

A modified version of ps(1) shows that the process state is clearly LSZOMB 
and not LSDEAD.  Furthermore, "ps -s" doesn't show any LWP for the zombie 
process, so it would seem that process clean up has progressed relatively 
far.


I was able to use "ps axl -O paddr" to get the address of the process's 
struct proc, but I don't seem to be able to examine it with gdb.


#  ps axl -O paddr | grep Z
UID   PIDPADDR  PPID   CPU PRI NI VSZRSS WCHAN   STAT 
TTY  TIME COMMAND
1000 25032 fe810f45ea40  2214 0  43  02240 48 -   R+ 
pts/1 0:00.00 grep Z
  0 21120 fe81f578ba70 1 0   0  0   0  0 -   Z 
pts/2 0:00.00 (sh)

#  gdb /dev/mem
GNU gdb (GDB) 7.9.1
...
This GDB was configured as "x86_64--netbsd".
...
(gdb) symbol-file /netbsd.gdb
Reading symbols from /netbsd.gdb...done.
(gdb) print (struct proc *) 0xfe81f578ba70
warning: value truncated
$1 = (struct proc *) 0xfe81f578ba70
(gdb)


Any clue on how to properly access kernel memory here, without having the 
address truncated to 32-bits?


TIA


For now, I took a quick look into the zombie's struct proc.

p_exitsig = 0x14   = SIGCHILD
p_flag= 0x0
p_sflag   = 0x2000 = PS_WEXIT
p_slflag  = 0x0
p_lflag   = 0x2= PL_CONTROLT
p_stflag  = 0x0
p_stat= 0x5= SZOMB

p_trace_enabled = 0x0
p_pid = 0x5280 = 21120 (the same value shown by ps)

I don't see anything unusual here.

I have attached the hex-dump in case anyone wants to look a little bit 
closer.




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+fe81f578ba70:   0   0   806be7000   
0   fbb7a840fe81
fe81f578ba90:   801 0   0   0   0   
0   f578baa0fe81
fe81f578bab0:   804d542e0   0   
f578bab8fe81804e7f9a
fe81f578bad0:   ef0106c0fe81f46f680 fe810   
0   e00b5700fe81
fe81f578baf0:   55fe8de8fe8180722de0
3be9b258fe800   0
fe81f578bb10:   0   7f7f0   0   
fff014  0
fe81f578bb30:   20000   2   0   
f683000552800   0
fe81f578bb50:   eab655b0fe81f45ecd0 fe81
1de352a0fe82fc805108fe81
fe81f578bb70:   0   0   21ccb560fe800   
0   1   1
fe81f578bb90:   0   0   0   1   0   
0   0   0
fe81f578bbb0:   0   0   0   0   0   
900045b20
fe81f578bbd0:   0   0   0   0   0   
0   0   0
fe81f578bbf0:   0   0   0   0   0   
0   0   0
fe81f578bc10:   0   0   e6023190fe81
806b63000   0
fe81f578bc30:   808be0e00   0   0   
0   0   0
fe81f578bc50:   f578bc48fe810   0   0   
0   0   0
fe81f578bc70:   1   0   0   0   0   
0   0   0
fe81f578bc90:   0   
0   0   0
fe81f578bcb0:   687314  656b0   0   0   
0   eab655b0fe81
fe81f578bcd0:   ffe07f7f0   1   0   
7f7f8012f010
fe81f578bcf0:   0   7f80f683b8e67f7f
deaddeadfe810   0
fe81f578bd10:   f578b558fe8124bf9400fe80
fff00   0
fe81f578bd30:   0   0   f578bd30fe81
8049abd40   0
fe81f578bd50:   f578bd48fe818049abd4
e50db0c0fe81f79e380 fe81
fe81f578bd70:   0   0   f48a4c48fe81
8072312080722de0
fe81f578bd90:   213f4228fe810   0   0   
7f7f0   0
fe81f578bdb0: 

Re: Killing a zombie process?

2015-10-01 Thread Paul Goyette

Still trying to track this down

A modified version of ps(1) shows that the process state is clearly 
LSZOMB and not LSDEAD.  Furthermore, "ps -s" doesn't show any LWP for 
the zombie process, so it would seem that process clean up has 
progressed relatively far.


I was able to use "ps axl -O paddr" to get the address of the process's 
struct proc, but I don't seem to be able to examine it with gdb.


#  ps axl -O paddr | grep Z
 UID   PIDPADDR  PPID   CPU PRI NI VSZRSS WCHAN   STAT TTY  
TIME COMMAND
1000 25032 fe810f45ea40  2214 0  43  02240 48 -   R+   
pts/1 0:00.00 grep Z
   0 21120 fe81f578ba70 1 0   0  0   0  0 -   Z
pts/2 0:00.00 (sh)
#  gdb /dev/mem
GNU gdb (GDB) 7.9.1
...
This GDB was configured as "x86_64--netbsd".
...
(gdb) symbol-file /netbsd.gdb
Reading symbols from /netbsd.gdb...done.
(gdb) print (struct proc *) 0xfe81f578ba70
warning: value truncated
$1 = (struct proc *) 0xfe81f578ba70
(gdb)


Any clue on how to properly access kernel memory here, without having 
the address truncated to 32-bits?


TIA


+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-09-30 Thread Brian Buhrow
Hello.  Did you mistype, did I misread or did you really mean to say
that the parent pid (ppid) is 0 on the offending zombie process?  that
could be a clue.  The ppid should be 1, not 0.  I wonder how, if that is
the case, the ppid of 0 gets assigned instead of 1?
-thanks
-Brian

On Sep 30,  3:55pm, Paul Goyette wrote:
} Subject: Re: Killing a zombie process?
} On Wed, 30 Sep 2015, Paul Goyette wrote:
} 
} >> # kill -HUP 1
} >> # ps axl | grep ' Z '
} >>   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 
} >> (sh)
} >
} > Well, it happened again!
} >
} > I rebooted earlier today, and then deinstalled and rebuilt about 40
} > packages within the pkgsrc/sysutils/mksandbox environment (all with
} > MAKE_JOBS=3 enabled).  After all packages were rebuilt, I exit from
} > the sandbox and run ./sandbox/dismount and get the error
} >
} > umount: /sandbox/bin: Device busy
} >
} > Sure enough, there's a new Zombie process, and its parent seems to be
} > init  (PPID==0)
} >
} > # ps axl | grep ' Z '
} >0 23848 28120  85  0   4360164 pipe_rd R+   pts/2  0:00.00 
} > grep  Z
} >0 2543910   0  0  0  0 -   Zpts/2  0:00.00 
} > (sh)
} >
} > HUPing init still doesn't help.
} >
} > So, I'm pretty sure that there's a bug somewhere, but haven't a clue
} > on where  to start looking.
} 
} Interestingly, if I shutdown to single-user mode, the zombie process 
} gets reaped and disappears!
} 
} So there must be some difference in how init(8) waits during normal 
} operation and how it waits during the transition to single-user.
} 
} 
} 
} +--+--+-+
} | Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
} | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
} | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
} +--+--+-+
>-- End of excerpt from Paul Goyette




Re: Killing a zombie process?

2015-09-30 Thread Paul Goyette

On Wed, 30 Sep 2015, Paul Goyette wrote:


# kill -HUP 1
# ps axl | grep ' Z '
  0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 
(sh)


Well, it happened again!

I rebooted earlier today, and then deinstalled and rebuilt about 40
packages within the pkgsrc/sysutils/mksandbox environment (all with
MAKE_JOBS=3 enabled).  After all packages were rebuilt, I exit from
the sandbox and run ./sandbox/dismount and get the error

umount: /sandbox/bin: Device busy

Sure enough, there's a new Zombie process, and its parent seems to be
init  (PPID==0)

# ps axl | grep ' Z '
	   0 23848 28120  85  0   4360164 pipe_rd R+   pts/2  0:00.00 
grep  Z
	   0 2543910   0  0  0  0 -   Zpts/2  0:00.00 
(sh)


HUPing init still doesn't help.

So, I'm pretty sure that there's a bug somewhere, but haven't a clue
on where  to start looking.


Interestingly, if I shutdown to single-user mode, the zombie process 
gets reaped and disappears!


So there must be some difference in how init(8) waits during normal 
operation and how it waits during the transition to single-user.




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-09-30 Thread Robert Elz
Date:Wed, 30 Sep 2015 15:55:04 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | So there must be some difference in how init(8) waits during normal 
  | operation and how it waits during the transition to single-user.

Either that (which isn't really all that likely I'd guess) or perhaps
the process is not yet linked to init, so can't be waited upon.   It
needs to be on init's child queue for wait to find it, regardless of
what the ppid has been set to.

I think I'd be checking out the sequence in the sys_exit() code, to see if
there's anything that happens, or could happen, between setting the ppid to 1
and linking the process onto process 1's child list that could perhaps block
and cause the zombie to just sit there (for this, once the process status
is Z, you can't really trust some of the other ps output, pid and ppid
should be correct, but whan is unlikely to have any meaning).

kre



Re: Killing a zombie process?

2015-09-30 Thread Robert Elz
Date:Wed, 30 Sep 2015 18:29:20 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | Well, a quick read through sbin/init.c shows that sometimes it waits 
  | with WNOHANG and sometimes it doesn't.

It is more that init reaps lots of zombie processes, missing just one of
them, occasionally, seems unlikely at best, whatever flags it gives wait().

Far more likely (IMO) is that the process in question is special somehow,
and the most likely special that would cause wait() to fail to see it, is
if the process isn't on init's child process list.   There might be
other possibilities, if the kernel wait code sometimes ignores zombie
processes for some other reason (some other resource still owned, or whatever).

  | Well, for the previous occurrence, I waited many hours, and the zombie 
  | was still there.  (It might even have been as much as a couple of days.)

Of course, it won't be time based where your shutdown just happened to
occur at the magic interval ... rather, shutdown will be causing some
other condition to occur (or be removed) which then allows the zombie
process to complete its transition into full zombiehood (???) and for
init to then clean it.

  | If I get really brave, I might even use gdb to attach to init(8) and see 
  | which of the several waitpid() calls is active.

I think I'd start with the proc structure of the zombie itself, and see
if there's anything unusual about it, see if all the processes resources
(like its kernel stack) have truly been freed already, and if not, just where
that process is sitting.   Since the zombie sits there essentially
forever (it seems) it ought to be fairly easy to check this just using
gdb on /dev/kmem without interrupting normal operations at all (ie: risk free).

On the other hand, checking init's child queue that way would be hard, as it
is in a constant state of churn.

kre



Re: Killing a zombie process?

2015-09-30 Thread Paul Goyette

On Wed, 30 Sep 2015, Robert Elz wrote:


   Date:Wed, 30 Sep 2015 15:55:04 +0800 (PHT)
   From:Paul Goyette 
   Message-ID:  

 | So there must be some difference in how init(8) waits during normal
 | operation and how it waits during the transition to single-user.

Either that (which isn't really all that likely I'd guess) ...


Well, a quick read through sbin/init.c shows that sometimes it waits 
with WNOHANG and sometimes it doesn't.  I haven't figured out the actual 
code-flow yet, so I can't tell if this accounts for the steady-state vs 
transition-to-single-user difference or not.



... or perhaps
the process is not yet linked to init, so can't be waited upon.   It
needs to be on init's child queue for wait to find it, regardless of
what the ppid has been set to.


Well, for the previous occurrence, I waited many hours, and the zombie 
was still there.  (It might even have been as much as a couple of days.) 
In today's event, the 'shutdown' transition was run less than one hour 
after the first notice, and at _that_ time the zombie was reaped.  It 
doesn't seem logical that the ppid gets set, but it gets enqueued only 
after starting a shutdown.



I think I'd be checking out the sequence in the sys_exit() code, to see
if there's anything that happens, or could happen, between setting
the ppid to 1 and linking the process onto process 1's child list that
could perhaps block and cause the zombie to just sit there (for this,
once the process status is Z, you can't really trust some of the other
ps output, pid and ppid should be correct, but whan is unlikely to have
any meaning).


Yeah, I'll have a look at the sys_exit() code and see what I can find. 
If I get really brave, I might even use gdb to attach to init(8) and see 
which of the several waitpid() calls is active.  (I'd prefer to do this 
in a qemu VM, but then I'd need to reproduce the entire environment 
inside the VM.)




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-09-30 Thread Paul Goyette

On Wed, 30 Sep 2015, Brian Buhrow wrote:


Hello.  Did you mistype, did I misread or did you really mean to say
that the parent pid (ppid) is 0 on the offending zombie process?  that
could be a clue.  The ppid should be 1, not 0.  I wonder how, if that is
the case, the ppid of 0 gets assigned instead of 1?


it's a typo.  The parent is init, PPID==1

 UID   PID  PPID   CPU PRI NI VSZRSS WCHAN   STAT TTY   TIME COMMAND
   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 (sh)
 ^^^




-thanks
-Brian

On Sep 30,  3:55pm, Paul Goyette wrote:
} Subject: Re: Killing a zombie process?
} On Wed, 30 Sep 2015, Paul Goyette wrote:
}
} >> # kill -HUP 1
} >> # ps axl | grep ' Z '
} >>   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00
} >> (sh)
} >
} > Well, it happened again!
} >
} > I rebooted earlier today, and then deinstalled and rebuilt about 40
} > packages within the pkgsrc/sysutils/mksandbox environment (all with
} > MAKE_JOBS=3 enabled).  After all packages were rebuilt, I exit from
} > the sandbox and run ./sandbox/dismount and get the error
} >
} >  umount: /sandbox/bin: Device busy
} >
} > Sure enough, there's a new Zombie process, and its parent seems to be
} > init  (PPID==0)
} >
} >  # ps axl | grep ' Z '
} > 0 23848 28120  85  0   4360164 pipe_rd R+   pts/2  0:00.00
} > grep  Z
} > 0 2543910   0  0  0  0 -   Zpts/2  0:00.00
} > (sh)
} >
} > HUPing init still doesn't help.
} >
} > So, I'm pretty sure that there's a bug somewhere, but haven't a clue
} > on where  to start looking.
}
} Interestingly, if I shutdown to single-user mode, the zombie process
} gets reaped and disappears!
}
} So there must be some difference in how init(8) waits during normal
} operation and how it waits during the transition to single-user.
}
}
}
} +--+--+-+
} | Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
} | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
} | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
} +--+--+-+

-- End of excerpt from Paul Goyette






+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Killing a zombie process?

2015-09-24 Thread Paul Goyette
I'm not sure how I got to this point (but see high-level steps below). 
I have this zombie process:


root27237  0.0  0.0   0  0 pts/2- Z  - 0:00.00 (sh)

Various web resources say "kill the parent" and the zombie child will 
die, too.  But that's probably not a good idea here, since the parent is 
(or at least, appears to be) init (pid==1).


I checked for other potential parents (ie, any process with pts/2 for 
its TTY), and found two shell processes (one was my "login" shell on 
that terminal, and the other was the result of a "su" command).  I 
logged out of both processes, but the zombie remained.


This is the second time this has happened, and both times were when I 
was using pkgsrc's mksandbox to rebuild something.  The sandbox is 
"almost" standard, created with this command:


# mksandbox --src=/build/netbsd-local/src   \
--xsrc=/build/netbsd-local/xsrc \
--rwdirs=/tmp   \
/sandbox

(I added the rwdirs=/tmp so that /sandbox/tmp would be a memory-based 
tmpfs filesystem.)


I wouldn't usually worry too much about the zombie, but it's running 
/bin/sh _from_within_the_sandbox_ and therefore its image/text file owns 
a reference to /sandbox/bin/sh and this reference prevents me from 
properly unmounting the sandbox.


I suppose I could just manually run "umount -f" but I just hate forcing 
an unmount of an in-use file-system.  :)


Suggestions?


+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-09-24 Thread Gary Duzan
In Message ,
   Paul Goyette wrote:

=>I'm not sure how I got to this point (but see high-level steps below). 
=>I have this zombie process:
=>
=>root27237  0.0  0.0   0  0 pts/2- Z  - 0:00.00 (sh)
=>
=>Various web resources say "kill the parent" and the zombie child will 
=>die, too.  But that's probably not a good idea here, since the parent is 
=>(or at least, appears to be) init (pid==1).

   Can you conform with "ps axl"?

=>I checked for other potential parents (ie, any process with pts/2 for 
=>its TTY), and found two shell processes (one was my "login" shell on 
=>that terminal, and the other was the result of a "su" command).  I 
=>logged out of both processes, but the zombie remained.
=>
=>This is the second time this has happened, and both times were when I 
=>was using pkgsrc's mksandbox to rebuild something.  The sandbox is 
=>"almost" standard, created with this command:
=>
=>  # mksandbox --src=/build/netbsd-local/src   \
=>  --xsrc=/build/netbsd-local/xsrc \
=>  --rwdirs=/tmp   \
=>  /sandbox
=>
=>(I added the rwdirs=/tmp so that /sandbox/tmp would be a memory-based 
=>tmpfs filesystem.)
=>
=>I wouldn't usually worry too much about the zombie, but it's running 
=>/bin/sh _from_within_the_sandbox_ and therefore its image/text file owns 
=>a reference to /sandbox/bin/sh and this reference prevents me from 
=>properly unmounting the sandbox.
=>
=>I suppose I could just manually run "umount -f" but I just hate forcing 
=>an unmount of an in-use file-system.  :)
=>
=>Suggestions?

   If init is really its parent, check its "ps axl" output and
check its WCHAN. If it isn't "wait", maybe run "ktruss -p 1" to
get an idea of what it is doing instead of wait*() calls.

Gary Duzan





Re: Killing a zombie process?

2015-09-24 Thread Paul Goyette

On Thu, 24 Sep 2015, Gary Duzan wrote:


In Message ,
  Paul Goyette wrote:

=>I'm not sure how I got to this point (but see high-level steps below).
=>I have this zombie process:
=>
=>root27237  0.0  0.0   0  0 pts/2- Z  - 0:00.00 (sh)
=>
=>Various web resources say "kill the parent" and the zombie child will
=>die, too.  But that's probably not a good idea here, since the parent is
=>(or at least, appears to be) init (pid==1).

  Can you conform with "ps axl"?


 UID   PID  PPID   CPU PRI NI VSZRSS WCHAN   STAT TTY   TIME COMMAND
   0 0 0 0   0  0   0  15044 -   OKl  ?  1:49.02 [syste
   0 1 0 0  85  0   13092   1336 waitIs   ?  0:00.59 init
...
   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 (sh)


Yup, my zombie's parent PPID==1




  If init is really its parent, check its "ps axl" output and
check its WCHAN. If it isn't "wait", maybe run "ktruss -p 1" to
get an idea of what it is doing instead of wait*() calls.


See ps output above;  init's WCHAN==wait

So no clue on why it's not getting around to reaping child 27237.


+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-09-24 Thread Paul Goyette

On Thu, 24 Sep 2015, Greg Troxel wrote:



Paul Goyette  writes:


On Thu, 24 Sep 2015, Gary Duzan wrote:
Yup, my zombie's parent PPID==1


  If init is really its parent, check its "ps axl" output and
check its WCHAN. If it isn't "wait", maybe run "ktruss -p 1" to
get an idea of what it is doing instead of wait*() calls.


See ps output above;  init's WCHAN==wait

So no clue on why it's not getting around to reaping child 27237.


I would try sending init a HUP, which should rescan /etc/ttys and not
really do anything.  But it will then call wait(2) again, and if there
was a glitch where init was already in wait and the offending process's
transition to zombie and ppid==1 didn't cause a wakeup, then it may
resolve.


No luck...  I HUPed init, but the zombie is still there.

# kill -HUP 1
# ps axl | grep ' Z '
   0 27237 1 0   0  0   0  0 -   Zpts/2- 0:00.00 (sh)




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Killing a zombie process?

2015-09-24 Thread Greg Troxel

Paul Goyette  writes:

> On Thu, 24 Sep 2015, Gary Duzan wrote:
> Yup, my zombie's parent PPID==1
>
>>   If init is really its parent, check its "ps axl" output and
>> check its WCHAN. If it isn't "wait", maybe run "ktruss -p 1" to
>> get an idea of what it is doing instead of wait*() calls.
>
> See ps output above;  init's WCHAN==wait
>
> So no clue on why it's not getting around to reaping child 27237.

I would try sending init a HUP, which should rescan /etc/ttys and not
really do anything.  But it will then call wait(2) again, and if there
was a glitch where init was already in wait and the offending process's
transition to zombie and ppid==1 didn't cause a wakeup, then it may
resolve.


pgpCqfQnHuAzu.pgp
Description: PGP signature