Hey,

(Note: none of my changes have been committed, they are not complete or 
integrated with Gustaf's code yet.)

I also looked more at this, the idea of triggering the accounting process at 
the end of a thread's life seems very good. One real problem with this setup 
is that there is no independent thread which can be triggered to do a pool 
maintenance cycle. Until now this has been characterized by several bad 
behaviors which leads to poor performance and too much thread exiting on 
timeouts. If a timeout exists, any lull in activity will expire all threads. 

I haven't tried Gustaf's patch yet because I have been working on the 
accounting portion of the problem. Here are some of the things I have tried, 
and the apparent results:

1. Added a new counter to the Pool structure to signal the number of threads 
to create. (poolPtr->threads.create)
2. A connection thread incrs this under a certain conditions:
  a.     if (ncons == 1 && poolPtr->threads.maxconns > 0 ) {
           ++poolPtr->threads.create;
         }
        However, this leads to slightly more threads sometimes than maxthreads,
        Usually, the number of threads exactly equals minthreads due to other
        changes. Maxthreads is exceeded if you use a low number of maxconns.
  b.  Thread timeout is moved inside the wait loop, and recalculated on each 
      cycle:

        while (!poolPtr->shutdown
               && status == NS_OK
               && poolPtr->queue.wait.firstPtr == NULL) {
            /*
             * Wait for a connection to arrive, exiting if one doesn't
             * arrive in the configured timeout period.
             */
            if (poolPtr->threads.current -1 <= poolPtr->threads.min) {
                timePtr = NULL;
            } else {
                Ns_GetTime(&wait);
                Ns_IncrTime(&wait, poolPtr->threads.timeout, 0);
                timePtr = &wait;
            }
            
            status = Ns_CondTimedWait(&poolPtr->cond, &poolPtr->lock, timePtr);
        }

     Also, the conditions for infinite wait are slightly changed.

 c.  If a thread times out, a check is made if this will push the min threads
     too low, and an additional create is added. It seems possible that a
     single thread could signal the creation of two threads during it's
     lifetime, but this may just be making up for threads which exit without
     signalling a new thread. This is probably a mistake. What I have observed 
     is that this doesn't prevent threads from exiting below minthreads if
     there is a lull in activity, but create still has the accounting of these
     and if a single request comes in, they are all created. Then they may
     timeout again. 

     Probably timed out threads should not be replaced, but a way to keep
     timed out threads from exiting if this pushes below minthreads doesn't 
     exist yet. So this will be removed, but for now it allows a little probe
     as to the thread state and driver thread state.
        if (poolPtr->queue.wait.firstPtr == NULL) {
            msg = "timeout waiting for connection";
            if (poolPtr->threads.current -1 <= poolPtr->threads.min) {
                ++poolPtr->threads.create;
            }
            break;
        }
3. The number of queued requests was unknown before. poolPtr->threads.queued 
only increased. I added a line to reduce this number when requests are 
dequeued. This made it very easy to discover how things are working between 
the driver thread and the connection threads. 

For instance, if a heavy load of requests suddenly ends, there appear to be 
leftover queued requests. Using netstat, you can see these in CLOSE_WAIT. 
These appear to be requests which were abandoned (by apache bench), but the 
driver thread or connection threads are stuck, and they remain in this state 
until another (even single) request comes in. 

So another difference with this code appears to be that the queued, but 
abandoned requests are not being serviced anymore. It may be that the driver 
thread gets triggerd one more time and cleans up the mess. I still have to 
verify this, but it would be good if true. 

4. Gustaf identified another problem with the code: the thread creation proc 
includes a thread join. This makes it impossible to use inside a connection
thread. There is actually no reason to do the thread join in a the thread 
create proc. Only the driver thread can do this, or the pool exit code. So it 
should be removed from where it is and placed in the NsQueueConn proc, just 
before thread creation is done. 

My code doesn't yet reflect this idea, but I have added another proc to create 
multiple threads. It does the accounting and then loops while 
poolPtr->threads.create-- > 0:

void
NsCreateConnThreads(Pool *poolPtr)
{
    /*
     * Reap any dead threads. NOTE: Move this to NsQueueConn
     */
    NsJoinConnThreads();

    /*
     * Create new connection threads.
     */
    Ns_MutexLock(&poolPtr->lock);

    if (poolPtr->threads.queued > 1) {
        poolPtr->threads.create += 1;
    }
    if ( poolPtr->threads.current < poolPtr->threads.min ) {
        poolPtr->threads.create += poolPtr->threads.min - 
poolPtr->threads.current;
    }
    while (poolPtr->threads.create-- > 0) {
        NsCreateConnThread(poolPtr);
        ++poolPtr->threads.idle;
        ++poolPtr->threads.current;
    }
    Ns_MutexUnlock(&poolPtr->lock);
    Ns_CondBroadcast(&poolPtr->cond);
}

It isn't obvious how to create more than minthreads. All that is known, as far 
as I can tell, is that if the number of queued requests is greater than 1, an 
additional thread should be created for the incoming request. A queue with 
more than one request waiting means that all threads are busy, and there is 
still no guarantee that all of them will not exit and leave the current 
request waiting. This may push the number of threads above maxthreads by 1.

Weirdly, it is easy to maintain minthreads under continuous load. The above 
code, perfectly calculates the deficit when there are no threads exiting due 
to a timeout. This is an important difference between the previous code which 
was thread starved after thread exit. Only new requests could boost the 
number up, and only by 1. 

Another addition is the final line: Ns_CondBroadcast. I haven't tried this 
inside the mutex yet, it is supposed to be okay outside, but behavior isn't 
guaranteed to be consistent.

I still haven't considered how (or if) to integrate this with Gustaf's latest 
code. His code appears to minimize resources, taking care to replace any 
thread which exits when there is still work to be done. Due to my changes,
poolPtr->queue.wait.num = poolPtr->threads.queued. Maybe I should have used 
queue.wait.num, but the fact that they are the same (or should be) is good to 
know. (poolPtr->threads.queued is printed with the ns_pools get command).

My code has the potential to exceed maxthreads while a thread services it's 
last request. In practice, this doesn't happen unless maxconns is set low and 
requests take a long time. Essentially this jumpstarts the thread creation by 
a little bit, but even Gustaf's code creates a thread prior to exit, so the 
difference may not be that large (or it might be!). It isn't clear to me yet 
that if a thread exits or starts if this affects the memory use. Considering 
the fact that pool threads can be shared with different servers, interps must 
hang around with the bulk of the bytes even on thread exit. 

One thing I'm not too sure about (due completely to ignorance on my part) is 
this part of Gustaf's code:

  505     Ns_MutexUnlock(&poolPtr->lock);
  506 
  507     if (poolPtr->queue.wait.num > 0 && poolPtr->threads.idle == 0 
&& !poolPtr->shutdown) {
  508         /* We are exiting from a thread in a situation, where more
  509            queue entries are waiting. Since no other mechanism ensures
  510            that the entries are processed, we recreate a new connection 
thread.
  511         */
  512         Ns_MutexLock(&poolPtr->lock);
  513         poolPtr->threads.current++;
  514         poolPtr->threads.idle++;
  515         Ns_MutexUnlock(&poolPtr->lock);
  516         NsCreateConnThread(poolPtr, 0); /* joinThreads == 0 to avoid 
deadlock */
  517     }

The if statement on line 507 checks the Pool structure outside of a mutex 
lock. I don't believe there is any guarantee that we know what is going to 
happen next. These conditions could change before the code has a chance to 
grab a mutex. 

tom jackson

On Saturday 20 October 2007 05:35, Gustaf Neumann wrote:
> Tom Jackson schrieb:
> > On Friday 19 October 2007 09:12, Gustaf Neumann wrote:
> >> Actually, if minthreads is set to a value > 0 (it was not set), then
> >> idle threads should care about queued requests. Maybe the
> >> easier fix is to set minthreads per default to 1. I will
> >> try this, when i am back home (in a couple of hours).
> >
> > AOLserver does not respect minthreads except at startup. This is part of
> > the same issue: Nothing except a request can create a thread. It seems
> > like on thread exit, a little accounting process could go on to bring
> > threads.current up to threads.min, this might require more than one
> > thread creation, maybe not.
>
> fully agreed.
>
> i continued a little on the isue and commited a patch
> to CVS, which respects
> maxconns always. Instead of performing
> in boundary situations in the worst case more than the
> configured maxconns requests, the code creates now
> a new connection thread automatically after the exit of a thread
> coming to the end of it work cycle, when jobs are pending
> and no other thread is able to process these.
>
> i made as well a small variant of this (not in CVS), which
> defines a new command NsCheckPools() in pools.c
> which iterates over pools and starts a thread in the
> same situation as above.
> NsCheckPools could be extended to check for the existance of
> minthreads etc. However, for this patch, i wanted to be
> as little invasive as necessary, and added the thread-revival
> code to the place, where a thread exits.
>
> For the new thread generation, i had to parameterize
> NsCreateConnThread() to avoid a resource deadlock
> in Ns_ThreadJoin().
>
> > In your patch, you change the while loop test:
> > +   while (poolPtr->threads.maxconns <= 0
> > +   || ncons-- > 0
> > +   || poolPtr->queue.wait.num > 1) {
> >
> > Shouldn't the loop continue with poolPtr->queue.wait.num > 0 ?
>
> 0 or 1, both is fine here. the condition trigger for the new cases only
> when > 1 is used.
>
> > Your patch looks like a great fix...I just still don't understand why the
> > server would completely lock up. As long as you have requests coming in,
> > seems like a new thread would get created. I wonder if what is happening
> > is that Apache Bench simple stops sending new requests because none of
> > the requests are finishing. If it can block, I wonder if simply visiting
> > with a browser would kick things off again?
>
> one can use the browser to kick things off again, but if there are
> already say 50 requests
> in the queue, and the browser hangs in the first request. The bad thing
> is, that in the
> meantime, some other bulky requests are queued as well. So the queue
> might be
> acutally permanently growing.
>
> for my testing, i use:
>
> ab -c 20 -n 35 http://localhost:8003/file-storage/view/dotlrn-2.3.1.tgz
>
> with maxthreads 5 and maxconns 3 I get reliable hangs, where
> the last thread exists with about 5 to 15 queued requests. Without the
> recent patches, the queue is processed until it is empty.
>
> btw, the reason, when the patch helps, is no miracle. It is completely
> clear why the server hangs. This bug is not a classical deadlock (it
> is possible that the queued reuests are processed), but shares some
> properties of a life lock (some other requests prevent the processing
> of some other requests, at least for a while). It is not a race condition
> either. I don't believe that TriggerDriver() is the right place to
> handle the problem, since the driver is happily accepting requests
> in the bug situation. It can certainly be, that the fixed bug is different
> from the bug the Jeff Rogers fixed originally.
>
> -gustaf neumann
>
> PS: i am pretty sure that this is the same bug as on openacs.org.
> Around the time of the bug, somebody in spain was apparently
> benchmarking his internet connection, downloading dotlrn*.tar*
> in multiple parallel sessions. Once i saw that the user
> was trying to download 1200 copies of dotlrn.tar in about 10 minutes.
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <[EMAIL PROTECTED]> with the body of "SIGNOFF AOLSERVER" in the
> email message. You can leave the Subject: field of your email blank.


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]> 
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.

Reply via email to