Re: [AOLSERVER] aolserver bug

Gustaf Neumann Thu, 14 Dec 2006 01:52:57 -0800

Hi folks,

I am following up a posting from January, describing a problem
which lead to frequent hangs on openacs.org.


Symptoms:
 - one connection thread after the other went into a
   busy loop, when the server was running out
   temporarily of connections.
 - when all connection threads are looping,
   load == #connection threads,
   server did not accept more http connects.
 - scheduled jobs running fine.

The machine has relative little ram (only 5 connection
threads configured) and is running postgres as well.
Increasing the number of connection threads
significantly is not a reasonable option there.

Couple of search engines enjoy the rich information
base (google, yahoo), some people seemed to
benchmark their ip-connections by getting thousands
of copies of certain files. This lead to the mentioned
problem, a very sluggish behavior
when a few connection threads are already looping
and a few lock-ups every day.

An upgrade from 4.0.10 to 4.5 did not help (same symptoms).

The problem went away, when i installed a modified
version of the patch provided originally by Jeff Rogers
(modified for 4.5, see below).

i would recommend to add this or a similar fix to the
code base of aolserver 4.5, since the reported problem
persists as well in the new driver of 4.5

best regards
-gustaf neumann

--- nsd/driver.c-orig   2006-12-12 09:06:18.000000000 -0800
+++ nsd/driver.c        2006-12-12 09:16:52.000000000 -0800
@@ -947,6 +947,9 @@
    Ns_CondBroadcast(&drvPtr->cond);
    Ns_MutexUnlock(&drvPtr->lock);

+ /* register a ready proc to trigger the poll */

+    Ns_RegisterAtReady(TriggerDriver,drvPtr);
+
    /*
     * Loop forever until signalled to shutdown and all
     * connections are complete and gracefully closed.





Jeff Rogers schrieb:

I found a bug in aolserver 4.0.10 (and previous 4.x versions, not sure about
earlier) that causes the server to lock up. I'm fairly certain I understand
the cause, and my fix appears to work although I'm not sure it is the best
approach.

The bug: when benchmarking the server with a program like ab with
concurrency=1 (that is, it issues a single request, waits for it to
complete, then immediately issues the next one) the server will lock up,

consuming no cpu, but not responding to any requests.

My explanation: when the max number of threads is hit then when a new
connection is queued (NsQueueConn) it will be unable to find a free
connection in the pool and the queueing fails, and the new connection is
added to the wait list (waitPtr).  If there is a wait list then no drivers
are polled for new connections (driver.c:801), rather it waits to be
triggered (SockTrigger) to indicate that a thread is available to handle the
connection.  The triggering is done when the connection is completed, within
NsSockClose.  NsSockClose in turn is going to be called somewhere within the
running of the connection (ConnRun - queue.c:617).  However, the available
thread is not put back onto the queue free list until after ConnRun has
completed (queue.c:638).  So if the driver thread runs in the time slice
after ConnRun has completed for all active connections but before they are
added back to the free list, then it attempts to queue the connection,
fails, adds it to the wait list, then waits for the trigger which will never
come, and everything stops.

The problem is a race condition, and as such is extremely timing sensitive;
I cannot reproduce the problem on a generic setup, but when I'm benchmarking
my OpenACS setup it hits the bug very quickly and reliably.  The explanation
suggests, and my testing confirms that it seems to occur much less reliably
with concurrency > 1 or if there is a small delay between sending the
connections.  Together these mean that the lockup is most likely to show up
in exactly my test case, while much less likely on a production server or
with high-concurrency load testing.

My solution is to register SockTrigger as a ready proc, which are run
immediately after the freed conns are put back on to the free queue
(queue.c:645).  This fixes the problem by ensuring that the trigger pipe is
notified strictly after the free queue is updated and the waiting conn will
sucessfully be queued.  However I'm not sure this is best: NsSockClose
attempts to minimize the number of times SockTrigger is called in the case
when multiple connections are being closed at the same time; my fix means it
is called exactly once for each connection, or twice counting the call in
NsSockClose.  It's not clear to me what adverse impact this has, if any, but
one thing that could be done is to remove the SockTrigger calls from
NsSockClose as redundant.  Some additional logic could be added into
SockTrigger to not send to the trigger pipe under certain conditions (i.e.,
if it has been triggered and not acknowledged yet, or if there is not waitin
connection), but that would require mutex protection which could ultimately
be more expensive than just blindly triggering the pipe.

Here's a context diff for my patch:
*** driver.c.orig       Thu Jan 12 11:39:05 2006
--- driver.c    Thu Jan 12 11:39:10 2006
***************
*** 773,778 ****
--- 773,781 ----
        drvPtr = nextDrvPtr;
      }

+ /* register a ready proc to trigger the poll */

+     Ns_RegisterAtReady(SockTrigger,NULL);

+/*

       * Loop forever until signalled to shutdown and all
       * connections are complete and gracefully closed.


-J


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]> 
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.



--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]> 
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.

Re: [AOLSERVER] aolserver bug

Reply via email to