I noticed a hanging child process on our ASF server aurora.
It currently uses 2.4.11 (plus the post tag commit) and event MPM.
Most processes exiting due to MaxConnectionsPerChild get cleaned up after some time but this one doesn't. It now hangs for more than an hour. I'll let it hang. In case anyone has a good question I can answer with gdb let me know.

It shows a strange connection view in the server status table:

PID     Connections     Threads Async connections
total   accepting       busy    idle    writing keep-alive      closing
93557   1       yes     0       0       0       0       0

So it has 1 connection, but 0s in all other columns.

The connection can be seen by lsof:

 FD     TYPE             DEVICE   SIZE/OFF   NODE NAME
txt VREG 183,3400335528 36497117 275235 /x1/www/archive.apache.org/dist/cordova/cordova-3.4.0-src.zip
  9u    PIPE 0xfffffe061ecfab60      16384        ->0xfffffe061ecfacb8
 10u    PIPE 0xfffffe061ecfacb8          0        ->0xfffffe061ecfab60
 24u  KQUEUE 0xfffffe033071be00                   count=0, state=0x2
41u IPv4 0xfffffe01316243d0 0t0 TCP 127.0.0.1:35849->127.0.0.1:8050 (CLOSE_WAIT) 83u IPv4 0xfffffe0255d08b70 0t0 TCP 127.0.0.1:52023->127.0.0.1:8050 (CLOSE_WAIT) 108u IPv4 0xfffffe09990eeb70 0t0 TCP 127.0.0.1:22532->127.0.0.1:8050 (CLOSE_WAIT)

This is the established connectioN:

110u IPv4 0xfffffe0255ab4b70 0t0 TCP 192.87.106.229:http->179.206.174.192:65496 (ESTABLISHED)

And this is likely the file being served on that connection:

126r VREG 183,3400335528 36497117 275235 /x1/www/archive.apache.org/dist/cordova/cordova-3.4.0-src.zip 156u IPv4 0xfffffe048d0ff3d0 0t0 TCP 127.0.0.1:26685->127.0.0.1:8050 (CLOSE_WAIT) 229u IPv4 0xfffffe0131d013d0 0t0 TCP 127.0.0.1:31538->127.0.0.1:8050 (CLOSE_WAIT)

netstat shows:

Proto Recv-Q Send-Q Local Address          Foreign Address        (state)
tcp4 0 87650 192.87.106.229.80 179.206.174.192.65496 ESTABLISHED

so there's 87650 bytes in the send-q. Most lilely the client hans't acked what we send.

The highest thread is in:

#0  0x0000000101a0887c in ?? () from /lib/libthr.so.3
No symbol table info available.
#1  0x0000000101a098c9 in ?? () from /lib/libthr.so.3
No symbol table info available.
#2 0x00000001015d0295 in apr_thread_join (retval=0x7fffffffd52c, thd=0x109410c88) at threadproc/unix/thread.c:217
        stat = <optimized out>
        thread_stat = 0xa
#3 0x0000000108d67a50 in join_workers (listener=0x109410c88, threads=0x10200b400) at event.c:2110
        iter = 0
        i = <optimized out>
        rv = <optimized out>
        thread_rv = 1
#4  0x0000000108d6a324 in child_main (child_num_arg=3) at event.c:2299
        threads = 0x2
        rv = 2
        ts = 0x1020a1fa8
        thread_attr = 0x1020a1fc8
        start_thread_id = 0x1020a1ff8

So it hangs in apr_thread_join(). From another gdb view I can see, that the two top most frames are actually pthread_kill() and the event line number shows, that it hangs while trying to kill the listener.

The listener itself is in:

#0  0x0000000101cfe4bc in kevent () from /lib/libc.so.7
No symbol table info available.
#1 0x00000001015cbe79 in impl_pollset_poll (pollset=0x1020a2dc8, timeout=<optimized out>, num=0x7fffef97cf88, descriptors=0x7fffef97cf68) at poll/unix/kqueue.c:263
        ret = 0
        i = <optimized out>
        j = <optimized out>
        tv = {tv_sec = 0, tv_nsec = 100000000}
        tvptr = 0x7fffef97c640
        rv = <optimized out>
fd = {p = 0x123805bf0, desc_type = 6848688, reqevents = 0, rtnevents = 0, desc = {f = 0x10201d0a0, s = 0x10201d0a0}, client_data = 0x123bd3f80} #2 0x0000000108d68688 in listener_thread (thd=0x109410c88, dummy=<optimized out>) at event.c:1519
        workers_were_busy = 0
        ep = <optimized out>
        rc = 70007
        ti = <optimized out>
        process_slot = 3
        tpool = 0x109411028
        csd = 0x12ff4a0b0
        ptrans = 0x12ff4a028
        lr = <optimized out>
        have_idle_worker = 0
        cs = <optimized out>
        out_pfd = 0x109411048
        num = 0
        timeout_interval = 100000
        timeout_time = 1421704478168661
        now = 1421704478068663
        last_log = 1421697060293018
        pt = <optimized out>
        closed = 1
        listeners_disabled = 0
#3  0x0000000101a010a4 in ?? () from /lib/libthr.so.3

The impl_pollset_poll() has a timeout of 100ms set. Some data for the kevent() args:

(gdb) print *pollset
$1 = {pool = 0x1020a1028, nelts = 0, nalloc = 256, flags = 19, wakeup_pipe = {0x6174736e692d3265, 0x612e7777772f6c6c}, wakeup_pfd = {p = 0x726f2e6568636170, desc_type = 1868771175, reqevents = 26222, rtnevents = 25135, desc = {f = 0x70695f64656e6e61, s = 0x70695f64656e6e61}, client_data = 0x666e6f632e73}, p = 0x1020a2e20,
  provider = 0x1017d9480 <impl>}
(gdb) print *pollset->p
$2 = {kqueue_fd = 24, kevent = {ident = 479, filter = -1, flags = 2, fflags = 0, data = 0, udata = 0x0}, setsize = 512, ke_set = 0x10940c028, result_set = 0x109411028, ring_lock = 0x1020a2e98, query_ring = {next = 0x1020a2e68, prev = 0x1020a2e68}, free_ring = {next = 0x1094c4138, prev = 0x11749c168}, dead_ring = {next = 0x1020a2e88,
    prev = 0x1020a2e88}}
(gdb) print *pollset->p->ke_set
$3 = {ident = 479, filter = -1, flags = 32768, fflags = 0, data = 0, udata = 0x11749c168}

CPU time very slowly increases, so I expect that the timeout for the kqueue fires but the listener thread doesn't exit although listener_may_exit is 1.

Regards,

Rainer

Reply via email to