Hi --

   This may be an old topic of conversation, in which case I apologize.
I Googled and searched marc.theaimslist.com and Apache Bugzilla but
didn't see anything, so here I am with a question.

   In brief, on Linux, when doing an ungraceful stop of httpd, any
 worker threads that are poll()ing on Keep-Alive connections don't get
awoken by close_worker_sockets() and that can lead to the process
getting the SIGKILL signal without ever getting the chance to run
apr_pool_destroy(pchild) in clean_child_exit().  This seems to
relate to this particular choice by the Linux and/or glibc folks:

http://bugme.osdl.org/show_bug.cgi?id=546


   The backstory goes like this: I spent a chunk of last week trying
to figure out why my module wasn't shutting down properly.  First I
found some places in my code where I'd failed to anticipate the order
in which memory pool cleanup functions would be called, especially
those registered by apr_thread_cond_create().

   However, after fixing that, I found that when connections were still
in the 15 second timeout for Keep-Alives, a child process could get the
SIGKILL before finished cleaning up.  (I'm using httpd 2.2.0 with the
worker MPM on Linux 2.6.9 [RHEL 4] with APR 1.2.2.)  The worker threads
are poll()ing and, if I'm reading my strace files correctly, they don't
get an EBADF until after the timeout completes.  That means that
join_workers() is waiting for those threads to exit, so child_main()
can't finish up and call clean_child_exit() and thus apr_pool_destroy()
on the pchild memory pool.

   This is a bit of a problem for me because I really need
join_workers() to finish up and the cleanups I've registered
against pchild in my module's child_init handler to be run if
at all possible.

   It was while researching all this that I stumbled on the amazing
new graceful-stop feature and submitted #38621, which I see has
already been merged ... thank you!

   However, if I need to do an ungraceful stop of the server --
either manually or because the GracefulShutdownTimeout has
expired without a chance to gracefully stop -- I'd still like my
cleanups to run.


   My solution at the moment is a pure hack -- I threw in
apr_sleep(apr_time_from_sec(15)) right before
ap_reclaim_child_processes(1) in ap_mpm_run() in worker.c.
That way it lets all the Keep-Alive timeouts expire before
applying the SIGTERM/SIGKILL hammer.  But that doesn't seem
ideal, and moreover, doesn't take into account the fact that
KeepAliveTimeouts > 15 seconds may have been assigned.  Even
if I expand my hack to wait for the maximum possible Keep-Alive
timeout, it's still clearly a hack.


   Does anyone have any advice?  Does this seem like a problem
to be addressed?  I tried to think through how one could signal
the poll()ing worker threads with pthread_kill(), but it seems
to me that not only would you have to have a signal handler
in the worker threads (not hard), you'd somehow have to break
out of whatever APR wrappers are abstracting the poll() once
the handler set its flag or whatever and returned -- the APR
functions can't just loop on EINTR anymore.  (Is it
socket_bucket_read() in the socket bucket code and then
apr_socket_recv()?  I can't quite tell yet.)  Anyway, it seemed
complex and likely to break the abstraction across OSes.

   Still, I imagine I'm not the only one who would really like
those worker threads to cleanly exit so everything else does ...
after all, they're not doing anything critical, just waiting
for the Keep-Alive timeout to expire, after which they notice
their socket is borked and exit.

   FWIW, I tested httpd 2.2.0 with the worker MPM on a Solaris
2.9 box and it does indeed do what the Linux "bug" report says;
poll() returns immediately if another thread closes the socket
and thus the whole httpd server exits right away.

   Thoughts, advice?  Any comments appreciated.

Chris.

-- 
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B

Reply via email to