daily CVS update output

2015-10-04 Thread NetBSD source update

Updating src tree:
P src/external/cddl/osnet/dist/lib/libdtrace/common/dt_module.c
P src/share/man/man9/module.9
P src/share/man/man9/pci.9
P src/sys/arch/evbarm/conf/RPI
P src/sys/arch/sparc/sparc/autoconf.c
P src/sys/arch/sparc/sparc/db_disasm.c
P src/sys/arch/sparc/sparc/db_interface.c
P src/sys/arch/sparc/sparc/msiiep.c
P src/sys/arch/sparc/sparc/pmap.c
P src/sys/arch/sparc/sparc/syscall.c
P src/sys/arch/sparc/sparc/trap.c
P src/sys/arch/sparc64/sparc64/db_disasm.c
P src/sys/arch/x86/x86/cpu_ucode_intel.c
P src/sys/dev/gpio/gpiobutton.c
P src/sys/dev/sbus/stp4020.c
P src/tests/usr.bin/config/t_config.sh

Updating xsrc tree:


Killing core files:

Running the SUP scanner:
SUP Scan for current starting at Mon Oct  5 03:06:02 2015
SUP Scan for current completed at Mon Oct  5 03:06:57 2015
SUP Scan for mirror starting at Mon Oct  5 03:06:57 2015
SUP Scan for mirror completed at Mon Oct  5 03:09:44 2015



Updating release-5 src tree (netbsd-5):

Updating release-5 xsrc tree (netbsd-5):

Running the SUP scanner:
SUP Scan for release-5 starting at Mon Oct  5 03:16:09 2015
SUP Scan for release-5 completed at Mon Oct  5 03:16:17 2015



Updating release-6 src tree (netbsd-6):

Updating release-6 xsrc tree (netbsd-6):

Running the SUP scanner:
SUP Scan for release-6 starting at Mon Oct  5 03:33:18 2015
SUP Scan for release-6 completed at Mon Oct  5 03:33:29 2015




Updating file list:
-rw-rw-r--  1 srcmastr  netbsd  52957159 Oct  5 03:40 ls-lRA.gz


Re: Problems with gdb?

2015-10-04 Thread Robert Elz
Date:Sun, 04 Oct 2015 18:03:26 +0700
From:Robert Elz 
Message-ID:  <10734.1443956...@andromeda.noi.kre.to>

  | Paul's problem is with the image file (the core) - or more likely, with
  | gdb (since crash(8) works).

Ignore me (aside from the part about copying /netbsd to /var/crash being
broken) I had not seen Paul's later message when I replied...

kre



Re: Killing a zombie process?

2015-10-04 Thread Paul Goyette

On Sun, 4 Oct 2015, Robert Elz wrote:


   Date:Sun, 4 Oct 2015 17:25:21 +0800 (PHT)
   From:Paul Goyette 
   Message-ID:  

 | I'm pretty much convinced that the p_nstopchild accounting is screwed up
 | somewhere.

I think I agree.

 | I'm planning on adding the following code in "optimization"
 | in kern_exit so I can catch it as soon as it happens.

Sooner, but unfortunately, most probably not soon enough.

It is most likely some locking/race condition with multiple processes
dying at the same time (approximately) that is causing some of the
increments to be lost.   Making them all use atomic ops, instead of just ++
might fix the problem, at the cost of never discovering where issue
actually occurs - there should be locks around all manipulations of
this stuff, possibly one of them is missing or misplaced.


Yeah, I think that there's a basic accounting problem somewhere, and 
with an extreme load it is more likely for the SSTOPed process to get 
inserted in the p_children/p_sibling list before the SZOMB process can 
get reaped.  Once the SSTOPed process gets to front-of line (with the 
parent's p_nstopchild count zero), the SZOMB process won't ever get 
processed.  My patch will simply validate this theory.


(BTW, the patch is actually wrong, as it would also panic in the case 
where the wait was for a specific pid.  I've modified it in my new 
kernel - not yet tested.)



It is unlikely to be in the wait processing (at least not this one) as
there's just one process doing the waiting, there would be no contention
for the accesses here (it could be a combination of the two though,
wait() happening at the same instant a process is dying).


See above.


I'm also puzzled by your observations of forked init processes having
exited - after rc is finished, init generally only forks when one of the
console/terminal sessions ends, and a new getty needs to be started.
On most modern systems, that's a very rare event - though if you use
the console (ctl-alt-Fn or whatever it is) switching, and login and out
of those (virtual) terminals, it would happen.  Is there anything like
that in your environment?


I do occassionally switch to another wsdisplay screen (away from the X 
one), but not frequently.  I definitely do a switch before I use 
Ctrl/Alt/Esc to get into ddb.


I'm wondering if some (most? all?) of the SSTOPd processes I see are a 
result of entering ddb and/or triggering the reboot?  Doesn't ddb need 
to stop whatever is running on "the other CPU cores" ?




+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


daily CVS update output

2015-10-04 Thread NetBSD source update

Updating src tree:
P src/libexec/lfs_cleanerd/lfs_cleanerd.c
P src/sbin/fsck_lfs/extern.h
P src/sbin/fsck_lfs/fsck.h
P src/sbin/fsck_lfs/lfs.c
P src/sbin/fsck_lfs/pass1.c
P src/sbin/fsck_lfs/pass6.c
P src/sbin/fsck_lfs/segwrite.c
P src/sbin/fsck_lfs/setup.c
P src/sys/dev/pci/pci_subr.c
P src/sys/ufs/lfs/lfs.h
P src/sys/ufs/lfs/lfs_accessors.h
P src/sys/ufs/lfs/lfs_bio.c
P src/sys/ufs/lfs/lfs_rfw.c
P src/sys/ufs/lfs/lfs_segment.c
P src/sys/ufs/lfs/lfs_subr.c
P src/usr.sbin/dumplfs/dumplfs.c

Updating xsrc tree:


Killing core files:

Running the SUP scanner:
SUP Scan for current starting at Sun Oct  4 04:09:11 2015
SUP Scan for current completed at Sun Oct  4 04:43:32 2015
SUP Scan for mirror starting at Sun Oct  4 04:43:32 2015
SUP Scan for mirror completed at Sun Oct  4 06:11:25 2015




Updating file list:
-rw-rw-r--  1 srcmastr  netbsd  53007056 Oct  4 09:24 ls-lRA.gz


Re: Killing a zombie process?

2015-10-04 Thread Paul Goyette
I'm pretty much convinced that the p_nstopchild accounting is screwed up 
somewhere.  I'm planning on adding the following code in "optimization" 
in kern_exit so I can catch it as soon as it happens.


Basically, if the optimization would cause us to stop looking for a 
process to report, this hack/patch will just scan the rest of the 
sibling list.  If it finds a zombie that should be reported, it will 
panic, and I'll have pointers to both the zombie and the process at 
which the optimization occurred.


Comments?


Index: kern_exit.c
===
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.245
diff -u -p -r1.245 kern_exit.c
--- kern_exit.c 2 Oct 2015 16:54:15 -   1.245
+++ kern_exit.c 4 Oct 2015 09:15:00 -
@@ -788,6 +788,14 @@ find_stopped_child(struct proc *parent,
break;
}
if (parent->p_nstopchild == 0 || child->p_pid == pid) {
+/* XXX */
+   struct proc *nxtchild = child;
+   while (nxtchild = LIST_NEXT(nxtchild, p_sibling)
+   if (nxtchild->p_stat == SZOMB)
+   panic("Zombie %p not reaped - "
+   "scan stopped at proc %p",
+   nxtchild, child);
+/* XXX */
child = NULL;
break;
}



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Problems with gdb?

2015-10-04 Thread Robert Elz
Date:Sun, 4 Oct 2015 10:26:10 +0300
From:Andreas Gustafsson 
Message-ID:  <22032.54418.500901.577...@guava.gson.org>

  | Paul Goyette wrote:
  | > In attempts to debug another problem (see the thread about "killing 
  | > zombies"), I've twice forced crash dumps from ddb.  Once with the 'sync' 
  | > command, and once with 'reboot 0x104'.
  | [...]
  | > Yet, gdb fails to process these files:
  | 
  | PR 48915?

No, that one (I believe) relates to the /var/crash/netbsd.N files
being a mess - something goes horribly wrong in the way they're created
(I had observed that one as well, though I wasn't aware there was a PR
about it.)

That one is a nuisance bug, as one can always just use the /netbsd file
that had been booted when the system crashed instead.

Paul's problem is with the image file (the core) - or more likely, with
gdb (since crash(8) works).

kre



Re: Killing a zombie process?

2015-10-04 Thread Robert Elz
Date:Sun, 4 Oct 2015 17:25:21 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | I'm pretty much convinced that the p_nstopchild accounting is screwed up 
  | somewhere.

I think I agree.

  | I'm planning on adding the following code in "optimization" 
  | in kern_exit so I can catch it as soon as it happens.

Sooner, but unfortunately, most probably not soon enough.

It is most likely some locking/race condition with multiple processes
dying at the same time (approximately) that is causing some of the
increments to be lost.   Making them all use atomic ops, instead of just ++
might fix the problem, at the cost of never discovering where issue
actually occurs - there should be locks around all manipulations of
this stuff, possibly one of them is missing or misplaced.

It is unlikely to be in the wait processing (at least not this one) as
there's just one process doing the waiting, there would be no contention
for the accesses here (it could be a combination of the two though,
wait() happening at the same instant a process is dying).

I'm also puzzled by your observations of forked init processes having
exited - after rc is finished, init generally only forks when one of the
console/terminal sessions ends, and a new getty needs to be started.
On most modern systems, that's a very rare event - though if you use
the console (ctl-alt-Fn or whatever it is) switching, and login and out
of those (virtual) terminals, it would happen.  Is there anything like
that in your environment?

kre



Re: Killing a zombie process?

2015-10-04 Thread Paul Goyette

On Sun, 4 Oct 2015, Paul Goyette wrote:


 | 1. Is it correct for init's p_nstopchild to be zero when it has several
 | children whose p_state is SSTOP?

Depends whether those children have previously been waited for or not.
Stopped children don't go away when they're waited for, so there needs
to be something to prevent wait() returning the same stopped child
over and over again.   That's p_waited ... so you need to check that
value of the stopped children, if it is 0, then something is broken.
If it is 1 (for all of them) then they're irrelevant, and matter not
at all.



Here's another instance of the problem.  (Note that I'm limping along 
with crash(8) here since gdb isn't cooperating at the moment.)


crash> show proc 1
init: pid 1 proc fe810f46ecd0 vmspace/map fe810f483e60 flags 4001
  lwp 1 fe810f476a60 pcb fe810f464000
stat 2 flags 802 cpu 0 pri 43
crash> x/x 0xfe810f46ecd0+0x130
fe810f46ee00:   0   p_nstopchild == 0
crash> x/x 0xfe810f46ecd0+0x100,2
fe810f46edd0:   7b5f5800fe80p_children listhead

Looking at the first child...

crash> x/x 0xfe807b5f5800+0xd0
fe807b5f58d0:   4   p_stat == SSTOP
crash>
fe807b5f58d4:   6f68p_pid
crash> show proc 0x6f68
init: pid 28520 proc fe807b5f5800 vmspace/map fe807e7be480 flags 0
  lwp 1 fe811e636300 pcb fe81aae19000
stat 2 flags 802 cpu 3 pri 43
crash> x/x 0xfe807b5f5800+0x134
fe807b5f5934:   0   p_waited == 0
crash> x/x 0xfe807b5f5800+0xf0,2
fe807b5f58f0:   f46e520 fe81p_sibling.le_next

So, the first child of init appears to be another instance of init, and 
its state is SSTOP.  It has not been waited for, yet its parent (the 
"real" init, pid=1) has a zero count for p_nstopchild.



This problem is easily reproduced, but only under heavy-load conditions. 
On a amd64 (CPU = Intel i5-4460 @ 3.20GHz) 7.99.21 I've been running a 
'build.sh -j3 release' in parallel with a series of pkgsrc builds 
running with MAKE_JOBS=3;  it takes from 30 to 60 minutes of this before 
the Zombie appears. (The pkgsrc builds are running in chroot created by 
pkgsrc/sysutils/mksandbox.)



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Problems with gdb?

2015-10-04 Thread Paul Goyette

On Sun, 4 Oct 2015, Andreas Gustafsson wrote:


Paul Goyette wrote:

In attempts to debug another problem (see the thread about "killing
zombies"), I've twice forced crash dumps from ddb.  Once with the 'sync'
command, and once with 'reboot 0x104'.

[...]

Yet, gdb fails to process these files:


PR 48915?


Yup, looks like that's the one!

I can process the dump file successfully with

#  gdb /netbsd.gdb
GNU gdb (GDB) 7.9.1
Copyright (C) 2015 Free Software Foundation, Inc.

Reading symbols from /netbsd.gdb...done.
(gdb) target kvm netbsd.4.core
0x801196a5 in cpu_reboot (howto=howto@entry=256,
bootstr=bootstr@entry=0x0)
at /build/netbsd-local/src/sys/arch/amd64/amd64/machdep.c:671
671 dumpsys();
(gdb)



+--+--+-+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+--+--+-+


Re: Problems with gdb?

2015-10-04 Thread Andreas Gustafsson
Paul Goyette wrote:
> In attempts to debug another problem (see the thread about "killing 
> zombies"), I've twice forced crash dumps from ddb.  Once with the 'sync' 
> command, and once with 'reboot 0x104'.
[...]
> Yet, gdb fails to process these files:

PR 48915?
-- 
Andreas Gustafsson, g...@gson.org


Re: Killing a zombie process?

2015-10-04 Thread Robert Elz
Date:Sun, 4 Oct 2015 20:52:43 +0800 (PHT)
From:Paul Goyette 
Message-ID:  

  | I do occassionally switch to another wsdisplay screen (away from the X 
  | one), but not frequently.  I definitely do a switch before I use 
  | Ctrl/Alt/Esc to get into ddb.

OK, that could explain the forked init.

  | I'm wondering if some (most? all?) of the SSTOPd processes I see are a 
  | result of entering ddb and/or triggering the reboot?  Doesn't ddb need 
  | to stop whatever is running on "the other CPU cores" ?

No, not that kind of stop.

kre

ps: you might want to try fixing PR 50298 (that I just submitted) and see
if that makes a difference - I think the chances are about one in infinity,
but ...



Re: pkgsrc-2015Q3 released

2015-10-04 Thread Rhialto
On Wed 30 Sep 2015 at 10:29:16 -0400, Greg Troxel wrote:
> Basically yes.  Howver, you may want to do a final update of the tree
> From sourceforge and verify you have no uncommitted changes that you
> want to keep.  (If so, you will have to manage them manually.)

which currently gives errors about "cannot close CVS/Entries" and
"No space left on device"... precisely the sort of reasons we moved away
from there of course.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgpoN9Y81jqSp.pgp
Description: PGP signature