Re: [AOLSERVER] aolserver bug
On 2006.12.14, Gustaf Neumann [EMAIL PROTECTED] wrote: I am following up a posting from January, describing a problem which lead to frequent hangs on openacs.org. [...] i would recommend to add this or a similar fix to the code base of aolserver 4.5, since the reported problem persists as well in the new driver of 4.5 I've gone and accepted the patch for both HEAD and aolserver_v40_bp. Refer to SF BUG #1615787: http://aolserver.com/sf/bug/1615787 Thanks! -- Dossy -- Dossy Shiobara | [EMAIL PROTECTED] | http://dossy.org/ Panoptic Computer Network | http://panoptic.com/ He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on. (p. 70) -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank.
Re: [AOLSERVER] aolserver bug
Saw. Cool. Thanks! On 12/15/06, Dossy Shiobara [EMAIL PROTECTED] wrote: On 2006.12.14, Gustaf Neumann [EMAIL PROTECTED] wrote: I am following up a posting from January, describing a problem which lead to frequent hangs on openacs.org. [...] i would recommend to add this or a similar fix to the code base of aolserver 4.5, since the reported problem persists as well in the new driver of 4.5 I've gone and accepted the patch for both HEAD and aolserver_v40_bp. Refer to SF BUG #1615787: http://aolserver.com/sf/bug/1615787 Thanks! -- Dossy -- Dossy Shiobara | [EMAIL PROTECTED] | http://dossy.org/ Panoptic Computer Network | http://panoptic.com/ He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on. (p. 70) -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank. -- Nathan Folkman [EMAIL PROTECTED] -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank.
Re: [AOLSERVER] aolserver bug
Hi folks, I am following up a posting from January, describing a problem which lead to frequent hangs on openacs.org. Symptoms: - one connection thread after the other went into a busy loop, when the server was running out temporarily of connections. - when all connection threads are looping, load == #connection threads, server did not accept more http connects. - scheduled jobs running fine. The machine has relative little ram (only 5 connection threads configured) and is running postgres as well. Increasing the number of connection threads significantly is not a reasonable option there. Couple of search engines enjoy the rich information base (google, yahoo), some people seemed to benchmark their ip-connections by getting thousands of copies of certain files. This lead to the mentioned problem, a very sluggish behavior when a few connection threads are already looping and a few lock-ups every day. An upgrade from 4.0.10 to 4.5 did not help (same symptoms). The problem went away, when i installed a modified version of the patch provided originally by Jeff Rogers (modified for 4.5, see below). i would recommend to add this or a similar fix to the code base of aolserver 4.5, since the reported problem persists as well in the new driver of 4.5 best regards -gustaf neumann --- nsd/driver.c-orig 2006-12-12 09:06:18.0 -0800 +++ nsd/driver.c2006-12-12 09:16:52.0 -0800 @@ -947,6 +947,9 @@ Ns_CondBroadcast(drvPtr-cond); Ns_MutexUnlock(drvPtr-lock); +/* register a ready proc to trigger the poll */ +Ns_RegisterAtReady(TriggerDriver,drvPtr); + /* * Loop forever until signalled to shutdown and all * connections are complete and gracefully closed. Jeff Rogers schrieb: I found a bug in aolserver 4.0.10 (and previous 4.x versions, not sure about earlier) that causes the server to lock up. I'm fairly certain I understand the cause, and my fix appears to work although I'm not sure it is the best approach. The bug: when benchmarking the server with a program like ab with concurrency=1 (that is, it issues a single request, waits for it to complete, then immediately issues the next one) the server will lock up, consuming no cpu, but not responding to any requests. My explanation: when the max number of threads is hit then when a new connection is queued (NsQueueConn) it will be unable to find a free connection in the pool and the queueing fails, and the new connection is added to the wait list (waitPtr). If there is a wait list then no drivers are polled for new connections (driver.c:801), rather it waits to be triggered (SockTrigger) to indicate that a thread is available to handle the connection. The triggering is done when the connection is completed, within NsSockClose. NsSockClose in turn is going to be called somewhere within the running of the connection (ConnRun - queue.c:617). However, the available thread is not put back onto the queue free list until after ConnRun has completed (queue.c:638). So if the driver thread runs in the time slice after ConnRun has completed for all active connections but before they are added back to the free list, then it attempts to queue the connection, fails, adds it to the wait list, then waits for the trigger which will never come, and everything stops. The problem is a race condition, and as such is extremely timing sensitive; I cannot reproduce the problem on a generic setup, but when I'm benchmarking my OpenACS setup it hits the bug very quickly and reliably. The explanation suggests, and my testing confirms that it seems to occur much less reliably with concurrency 1 or if there is a small delay between sending the connections. Together these mean that the lockup is most likely to show up in exactly my test case, while much less likely on a production server or with high-concurrency load testing. My solution is to register SockTrigger as a ready proc, which are run immediately after the freed conns are put back on to the free queue (queue.c:645). This fixes the problem by ensuring that the trigger pipe is notified strictly after the free queue is updated and the waiting conn will sucessfully be queued. However I'm not sure this is best: NsSockClose attempts to minimize the number of times SockTrigger is called in the case when multiple connections are being closed at the same time; my fix means it is called exactly once for each connection, or twice counting the call in NsSockClose. It's not clear to me what adverse impact this has, if any, but one thing that could be done is to remove the SockTrigger calls from NsSockClose as redundant. Some additional logic could be added into SockTrigger to not send to the trigger pipe under certain conditions (i.e., if it has been triggered and not acknowledged yet, or if there is not waitin connection), but that would require mutex protection which could ultimately be more expensive than just blindly triggering the pipe. Here's a
Re: [AOLSERVER] aolserver bug
i would recommend to add this or a similar fix to the code base of aolserver 4.5, since the reported problem persists as well in the new driver of 4.5 Let me second Gustaf on this. Our site (openacs.org) was locking up every few hours for some weeks before we applied this patch. After applying it, we've gone two full days without lock-up. This bug is real, and this or another fix really need to get into the code. -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank.
Re: [AOLSERVER] aolserver bug
I'll take a look at this over the next couple of weeks when I'm on vacation and see if I can get it applied. Thanks for finding and fixing! :-) On 12/14/06, dhogaza@pacifier.com dhogaza@pacifier.com wrote: i would recommend to add this or a similar fix to the code base of aolserver 4.5, since the reported problem persists as well in the new driver of 4.5 Let me second Gustaf on this. Our site (openacs.org) was locking up every few hours for some weeks before we applied this patch. After applying it, we've gone two full days without lock-up. This bug is real, and this or another fix really need to get into the code. -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank. -- Nathan Folkman [EMAIL PROTECTED] -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank.
Re: [AOLSERVER] aolserver bug
You could try my patch; I'm pretty sure it won't have any ill effects (perhaps marginally more cpu use due to a few extra times through the driver loop) but that's hardly any worse than your occasional lockups. Another thing to look at is is you can get a core dump (kill -QUIT) or debugger attached; in my case the driver thread was sitting in poll() and all the connection threads were waiting inside pthread_wait_cond/Ns_CondTimedWait (so of course my first suspicion was a mutex deadlock, but that doesn't seem possible with the structure of the code). -J -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank.
Re: [AOLSERVER] aolserver bug
On Jan 12, 2006, at 11:34 AM, Jeff Rogers wrote: I found a bug in aolserver 4.0.10 (and previous 4.x versions, not sure about earlier) that causes the server to lock up. I'm fairly certain I understand the cause, and my fix appears to work although I'm not sure it is the best approach. FWIW, we see this happening on all our busy sites. Whether the cause is the same I don't know, but the symptoms are identical. We do nightly restarts, which helps reduce the incidence, but it is still not uncommon to have sites get restarted because they were non- responsive and yet the system load was nearly 0. janine -- AOLserver - http://www.aolserver.com/ To Remove yourself from this list, simply send an email to [EMAIL PROTECTED] with the body of SIGNOFF AOLSERVER in the email message. You can leave the Subject: field of your email blank.