After a couple days of research (and talking with Ed Korthof), I think
the mysterious bug is solved.  Recall the reproduction recipe:

  * start a long running 'svn import'
  * run 'apachectl graceful'
  * a few seconds later, httpd just hangs.

It looks like the main bug is with the 'rotatelogs' process that comes
with apache.  It has nothing to do with SSL at all -- it's
reproducible over HTTP, and even without subversion.

The theory (as I understand it) is this: if you set up your httpd.conf
appropriately, the httpd parent process launches a 'rotatelogs' child
process, along with N other httpd child processes.  All of the httpd
children keep write-pipes open to the rotatelogs process, and write log
data into their pipes.  'rotatelogs' has the job of reading these pipes
and spewing the data into appropriate files, creating new logfiles when
necessary.

Here's what Ed Korthof thinks is happening:

  * the svn client (using neon) opens a long-lived connection to do a
    commit.  Using 'keepalive', it sends a huge number of PUT and
    PROPPATCH request over one connection to a single httpd child.

  * when the 'graceful' signal hits, httpd children wait for their
    current connection to close, then exit.  Meanwhile, the httpd
    parent spawns a new "generation" of httpd children.  Obviously,
    the httpd child servicing the svn commit sticks around a very long
    time, because svn doesn't hang up until it's done sending
    everything.

  * for some reason, the 'rotatelogs' process dies.  It's not clear
    whether it's responding to a signal, or if the httpd parent is
    killing it, or what.  A new 'rotatelogs' takes its place, with new
    httpd children connecting to it.  Meanwhile, the "old" httpd child
    continues to service svn, and continues to write logdata to a dead
    pipe... there's nobody reading data from the pipe on the other end
    anymore!

  * Eventually the pipe fills up, and the httpd child just hangs
    trying to write to it.


I think this theory is true, for a few reasons:

  * every time I run 'strace -p PID' on the frozen httpd, it claims to
    writing logdata.   gdb confirms this as well.

  * edk is able to reproduce the problem without subversion, simply by
    hand-typing HTTP requests chained together by a Keep-Alive header.

  * 'svn import' claims to have received 'success' repsonses on about
    20 more files beyond what accesslog shows, implying a pipe-backup.

  * The clincher: in all my testing on different platforms (7 or 8
    different setups) this bug is reproducible *every* time httpd.conf
    is using 'rotatelogs', and the bug vanishes when I stop using
    'rotatelogs'.


Final analysis: 

    This looks like some kind of bug in Apache itself, not
    related to SSL or Subversion at all... it looks like a bug in the
    interaction between 'rotatelogs' process and clients that use
    Keep-Alive.

Any comments or thoughts?


Reply via email to