Re: pool_grow hangs (Re: CVS commit: src/sys)

2017-12-29 Thread Tobias Nygren
On Fri, 29 Dec 2017 16:40:52 +
co...@sdf.org wrote:

> On Fri, Dec 29, 2017 at 03:41:42PM +0100, Tobias Nygren wrote:
> > The machine has survived for 30+ minutes where it previously hung after
> > just 20 seconds.
> 
> Do you mean "30+ minutes so far" or does it still hang somewhere? where?

No, I did not manage to break it yet.


Re: pool_grow hangs (Re: CVS commit: src/sys)

2017-12-29 Thread coypu
On Fri, Dec 29, 2017 at 03:41:42PM +0100, Tobias Nygren wrote:
> The machine has survived for 30+ minutes where it previously hung after
> just 20 seconds.

Do you mean "30+ minutes so far" or does it still hang somewhere? where?


Re: pool_grow hangs (Re: CVS commit: src/sys)

2017-12-29 Thread Tobias Nygren
> Meanwhile another process, waiting for the grower to finish, is
> spinning forever at 100% doing the mutex_exit/mutex_enter/ERESTART
> thing on the same pool. It looks to me like the grower never actually
> gets scheduled to run.

The attached diff works around the problem by not releasing the lock
during allocation in the PR_NOWAIT case. I don't know if doing it that
way could have any negative side effects?
The machine has survived for 30+ minutes where it previously hung after
just 20 seconds.

Kind regards,
-Tobias
Index: subr_pool.c
===
RCS file: /cvsroot/src/sys/kern/subr_pool.c,v
retrieving revision 1.219
diff -u -r1.219 subr_pool.c
--- subr_pool.c 16 Dec 2017 03:13:29 -  1.219
+++ subr_pool.c 29 Dec 2017 14:33:40 -
@@ -1091,7 +1091,9 @@
if ((flags & PR_WAITOK) == 0)
pp->pr_flags |= PR_GROWINGNOWAIT;
 
-   mutex_exit(&pp->pr_lock);
+   if (flags & PR_WAITOK)
+   mutex_exit(&pp->pr_lock);
+
char *cp = pool_allocator_alloc(pp, flags);
if (__predict_false(cp == NULL))
goto out;
@@ -1102,7 +1104,8 @@
goto out;
}
 
-   mutex_enter(&pp->pr_lock);
+   if (flags & PR_WAITOK)
+   mutex_enter(&pp->pr_lock);
pool_prime_page(pp, cp, ph);
pp->pr_npagealloc++;
KASSERT(pp->pr_flags & PR_GROWING);
@@ -1115,8 +1118,9 @@
return 0;
 out:
KASSERT(pp->pr_flags & PR_GROWING);
+   if (flags & PR_WAITOK)
+   mutex_enter(&pp->pr_lock);
pp->pr_flags &= ~(PR_GROWING|PR_GROWINGNOWAIT);
-   mutex_enter(&pp->pr_lock);
return ENOMEM;
 }
 


pool_grow hangs (Re: CVS commit: src/sys)

2017-12-29 Thread Tobias Nygren
On Sat, 16 Dec 2017 03:13:29 +
matthew green  wrote:

> Module Name:  src
> Committed By: mrg
> Date: Sat Dec 16 03:13:29 UTC 2017
> 
> Modified Files:
>   src/sys/kern: subr_pool.c
>   src/sys/sys: pool.h
> 
> Log Message:
> hopefully workaround the irregularly "fork fails in init" problem.
> 
> if a pool is growing, and the grower is PR_NOWAIT, mark this.
> if another caller wants to grow the pool and is also PR_NOWAIT,
> busy-wait for the original caller, which should either succeed
> or hard-fail fairly quickly.
> 
> implement the busy-wait by unlocking and relocking this pools
> mutex and returning ERESTART.  other methods (such as having
> the caller do this) were significantly more code and this hack
> is fairly localised.
> 
> ok chs@ riastradh@

Hi!

I have an easily reproducable system hang that I believe originates
from this change. It can be triggered by doing lots of block and
network i/o (like 3 multiple rsyncs) on a uniprocessor system
running under Linux KVM.

Basically what happens is that for unknown reasons the PR_NOWAIT
grower blocks forever when it tries to reaquire the pool lock to
do pool_prime_page() in pool_grow().

Meanwhile another process, waiting for the grower to finish, is
spinning forever at 100% doing the mutex_exit/mutex_enter/ERESTART
thing on the same pool. It looks to me like the grower never actually
gets scheduled to run.

Also, although it doesn't fix the issue, this pr_flags modification
looks like it should be moved to after the mutex is acquired:

pp->pr_flags &= ~(PR_GROWING|PR_GROWINGNOWAIT);
mutex_enter(&pp->pr_lock);

Kind regards,
-Tobias