On Thu, Feb 24, 2011 at 09:29:16AM +0900, YONETANI Tomokazu wrote:
> On Mon, Feb 21, 2011 at 01:41:26PM +0900, YONETANI Tomokazu wrote:
> > Apparently a few hours of pbulk test can trigger this panic (this is
> > on x86_64 and the kernel is built from source as of 5347900e6).
> > As opposed to what KKASSERT claims, p->p_lock doesn't hold the non-zero
> > value in the frame 5:
> > 
>                       :
> >   (kgdb) fr 5
> >   #5  0xffffffff802989a1 in kern_wait (pid=<value optimized out>,
> >       status=0xffffffe05e997a74, options=1528637672, rusage=0x0,
> >       res=0xffffffe05e997b58) at /usr/src/sys/kern/kern_exit.c:901
> >   901                             KKASSERT(p->p_lock == 0);
> >   (kgdb) p p->p_lock
> >   $1 = 0
> 
> After poking here and the, I think this KKASSERT() can simply go away
> as proc_remove_zombie() will wait for p->p_lock to drop to zero anyway.
> 
>                       /*
>                        * Unlink the proc from its process group so that
>                        * the following operations won't lead to an
>                        * inconsistent state for processes running down
>                        * the zombie list.
>                        */
>                       KKASSERT(p->p_lock == 0);
>                       proc_remove_zombie(p);
>                       leavepgrp(p);

The following is what I have in my tree.  What it does is to hold proc_token
while waiting for p->p_lock to drop, just as done in proc_remove_zombie().
If I don't hold the proc_token as in the first chunk, I see the

  proc_remove_zombie: waiting for %p->p_lock to drop

message a few times every hour on the console.  I guess it may also
imply that the race is between a code which holds proc_token for
a long time but not p->p_token, like all*_scan().


diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
index 1e5a110..30fc0af 100644
--- a/sys/kern/kern_exit.c
+++ b/sys/kern/kern_exit.c
@@ -867,8 +867,10 @@ loop:
                         * put a hold on the process for short periods of
                         * time.
                         */
+                       lwkt_gettoken(&proc_token);
                        while (p->p_lock)
                                tsleep(p, 0, "reap3", hz);
+                       lwkt_reltoken(&proc_token);
 
                        /* Take care of our return values. */
                        *res = p->p_pid;
@@ -898,7 +900,6 @@ loop:
                         * inconsistent state for processes running down
                         * the zombie list.
                         */
-                       KKASSERT(p->p_lock == 0);
                        proc_remove_zombie(p);
                        leavepgrp(p);
 
diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c
index cb067f0..048a619 100644
--- a/sys/kern/kern_proc.c
+++ b/sys/kern/kern_proc.c
@@ -661,6 +661,7 @@ proc_remove_zombie(struct proc *p)
 {
        lwkt_gettoken(&proc_token);
        while (p->p_lock) {
+               kprintf("%s: waiting for %p->p_lock to drop\n", __func__, p);
                tsleep(p, 0, "reap1", hz / 10);
        }
        LIST_REMOVE(p, p_list); /* off zombproc */
-- 
1.7.3.2

Reply via email to