Re: No processes left after big AB test

Paul J. Reder Tue, 10 Apr 2001 07:08:52 -0700
On Mon, 9 Apr 2001, Bill Stoddard wrote:
> What do you have maxrequestsperchild set to? If it is not zero, then I suspect the 
>problem is that
> the main thread in the threaded child process can sometimes exit before the worker 
>threads.  If you
> take a look at threaded.c, you will see that the main child thread is turned into a 
>worker thread.
> If it exits first (due to max requests per child processing), the other threads in 
>the process go
> off into never-never land. Paul Reder has a fix to this problem but has not posted 
>it for some
> reason (perhaps because the fix exposes another serious bug in the threaded MPM in
> process_idle_server_maintenance?)
>
> Paul?

(Paul steps to the microphone) Uh, thank you for inviting me here today.

The reason I didn't post my patch is that after further testing, all it does is
delay the problem. Note that all of my work is done on Redhat Linux 6.1.

I was previously able to kill Apache under heavy load in about 30-60 seconds.
With my patch it took several hours but the end result was the same.

I end up with a boat load of threads whose owner pid is 1. None of which is serving
pages anymore (so Apache stops serving). I cannot attach a debugger to the threads
and I cannot strace them so I have no idea why the thread didn't exit (max requests
is set to 1000).

There seem to be several layers of problems here.
     1) The child_main that spawns threads (running worker_thread) goes into
          worker_thread then exits when it hits max requests leaving orphaned
          threads. My patch for this was to replace the final call to 
          worker_thread(my_info); in child_main with:
               while (worker_thread _count > 0)
                   sleep(3);
        This only delays the total death of Apache and has the undesirable side
          effect that there are periods of very low rps for the following reasons:
          1) server_main_loop will only start X number of servers (with replacements).
          2) Each server calls child_main to start Y number of threads (without 
replacements).
          3) Once the threads start reaching max_requests, they start dying off.
          4) You regularly end up with servers with one or two threads that can't go 
away
               immediately because they are processing very large responses.
          5) In a worse case scenario (which seems to happen more often than I would 
expect)
               you end up with X servers, each with 1 or 2 threads that are currently 
busy
               processing their last request (according to max requests) and are 
holding
               up the server going away. During this time, no new requests can be 
processed.

     2) Even after closing the above loophole, I am still ending up with large numbers 
of useless
          threads with owner of pid 1. Some of them are defunct, many are not. 
Occasionally I
          find that a defunct "process" has an owner pid of an existing non-defunct 
"process"
          whose owner pid is 1. This is the same behavior as before the above patch, 
it just takes
          longer to get to. 

     3) During one of my test runs I seem to have lost the main Apache process, but no 
core file
          was generated. There were a bunch of error log entries indicating that 
apr_accept didn't have
          enough memory, but there were 2000 logged lines after the last one of these.

     4) I believe, though at the moment cannot prove, that the same idle cleanup 
problem from 
          prefork exists in the threaded mpm. These other problems need to be fixed 
before I
          can test to verify this.

I am currently putting debugging "printf" code into the threaded mpm to log pids of 
new threads
and their owning pid so that I can get a better idea of which processes are being left 
around
and what the ownership hierarchy is supposed to be. I will also add logs to determine 
important
phases in the worker_threads life.

[EMAIL PROTECTED] wrote:
> 
> A few of us spoke about this bug at ApacheCon.  Our idea was to change
> APR, so that APR provides a function make a thread wait for signals.  If
> the program wants to do that in a separate thread, then it can just create
> a thread and make that function the starting function.  Otherwise, the
> main thread just becomes the signal thread.  The patch should be
> relatively easy to create, because it is more ripping out code than
> writing new.
> 
> Ryan
> 

I'm not seeing how the signal processing is impacting this at the moment. The main
Apache process handles the signal processing. The threads all use the pipe of death.
The problem has to do with the child_main thread owning the worker_threads (as far
as I can tell.

By the way, when I start Apache then run ps -efH (with no server load) I get something 
like
webadmin 21803     1  0 10:07 pts/3    00:00:00   httpd -d /home/webadmin/Apache   (1 
top level Apache)
webadmin 21805 21803  0 10:07 pts/3    00:00:00     httpd -d /home/webadmin/Apac   
(Start_Server number of these)
webadmin 21808 21805  0 10:07 pts/3    00:00:00       httpd -d /home/webadmin/Ap   (1 
per Start_server)
webadmin 21809 21808  0 10:07 pts/3    00:00:00         httpd -d /home/webadmin/   
(threads_per_Child number of these
webadmin 21812 21808  0 10:07 pts/3    00:00:00         httpd -d /home/webadmin/       
       -
webadmin 21815 21808  0 10:07 pts/3    00:00:00         httpd -d /home/webadmin/       
        -
webadmin 21818 21808  0 10:07 pts/3    00:00:00         httpd -d /home/webadmin/       
         -

I understand 21803 and I understand 21809 and its ilk. I also understand either 21805 
or 21808
but not both. What am I missing in the way that processes and threads are handled in 
APR/threaded mpm?

-- 
Paul J. Reder
-----------------------------------------------------------
"The strength of the Constitution lies entirely in the determination of each
citizen to defend it.  Only if every single citizen feels duty bound to do
his share in this defense are the constitutional rights secure."
-- Albert Einstein
Re: No processes left after big AB test

Reply via email to