I noticed a hanging child process on our ASF server aurora.
It currently uses 2.4.11 (plus the post tag commit) and event MPM.
Most processes exiting due to MaxConnectionsPerChild get cleaned up
after some time but this one doesn't. It now hangs for more than an
hour. I'll let it hang. In case anyone has a good question I can answer
with gdb let me know.
It shows a strange connection view in the server status table:
PID Connections Threads Async connections
total accepting busy idle writing keep-alive closing
93557 1 yes 0 0 0 0 0
So it has 1 connection, but 0s in all other columns.
The connection can be seen by lsof:
FD TYPE DEVICE SIZE/OFF NODE NAME
txt VREG 183,3400335528 36497117 275235
/x1/www/archive.apache.org/dist/cordova/cordova-3.4.0-src.zip
9u PIPE 0xfffffe061ecfab60 16384 ->0xfffffe061ecfacb8
10u PIPE 0xfffffe061ecfacb8 0 ->0xfffffe061ecfab60
24u KQUEUE 0xfffffe033071be00 count=0, state=0x2
41u IPv4 0xfffffe01316243d0 0t0 TCP
127.0.0.1:35849->127.0.0.1:8050 (CLOSE_WAIT)
83u IPv4 0xfffffe0255d08b70 0t0 TCP
127.0.0.1:52023->127.0.0.1:8050 (CLOSE_WAIT)
108u IPv4 0xfffffe09990eeb70 0t0 TCP
127.0.0.1:22532->127.0.0.1:8050 (CLOSE_WAIT)
This is the established connectioN:
110u IPv4 0xfffffe0255ab4b70 0t0 TCP
192.87.106.229:http->179.206.174.192:65496 (ESTABLISHED)
And this is likely the file being served on that connection:
126r VREG 183,3400335528 36497117 275235
/x1/www/archive.apache.org/dist/cordova/cordova-3.4.0-src.zip
156u IPv4 0xfffffe048d0ff3d0 0t0 TCP
127.0.0.1:26685->127.0.0.1:8050 (CLOSE_WAIT)
229u IPv4 0xfffffe0131d013d0 0t0 TCP
127.0.0.1:31538->127.0.0.1:8050 (CLOSE_WAIT)
netstat shows:
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 87650 192.87.106.229.80 179.206.174.192.65496
ESTABLISHED
so there's 87650 bytes in the send-q. Most lilely the client hans't
acked what we send.
The highest thread is in:
#0 0x0000000101a0887c in ?? () from /lib/libthr.so.3
No symbol table info available.
#1 0x0000000101a098c9 in ?? () from /lib/libthr.so.3
No symbol table info available.
#2 0x00000001015d0295 in apr_thread_join (retval=0x7fffffffd52c,
thd=0x109410c88) at threadproc/unix/thread.c:217
stat = <optimized out>
thread_stat = 0xa
#3 0x0000000108d67a50 in join_workers (listener=0x109410c88,
threads=0x10200b400) at event.c:2110
iter = 0
i = <optimized out>
rv = <optimized out>
thread_rv = 1
#4 0x0000000108d6a324 in child_main (child_num_arg=3) at event.c:2299
threads = 0x2
rv = 2
ts = 0x1020a1fa8
thread_attr = 0x1020a1fc8
start_thread_id = 0x1020a1ff8
So it hangs in apr_thread_join(). From another gdb view I can see, that
the two top most frames are actually pthread_kill() and the event line
number shows, that it hangs while trying to kill the listener.
The listener itself is in:
#0 0x0000000101cfe4bc in kevent () from /lib/libc.so.7
No symbol table info available.
#1 0x00000001015cbe79 in impl_pollset_poll (pollset=0x1020a2dc8,
timeout=<optimized out>, num=0x7fffef97cf88, descriptors=0x7fffef97cf68)
at poll/unix/kqueue.c:263
ret = 0
i = <optimized out>
j = <optimized out>
tv = {tv_sec = 0, tv_nsec = 100000000}
tvptr = 0x7fffef97c640
rv = <optimized out>
fd = {p = 0x123805bf0, desc_type = 6848688, reqevents = 0,
rtnevents = 0, desc = {f = 0x10201d0a0, s = 0x10201d0a0}, client_data =
0x123bd3f80}
#2 0x0000000108d68688 in listener_thread (thd=0x109410c88,
dummy=<optimized out>) at event.c:1519
workers_were_busy = 0
ep = <optimized out>
rc = 70007
ti = <optimized out>
process_slot = 3
tpool = 0x109411028
csd = 0x12ff4a0b0
ptrans = 0x12ff4a028
lr = <optimized out>
have_idle_worker = 0
cs = <optimized out>
out_pfd = 0x109411048
num = 0
timeout_interval = 100000
timeout_time = 1421704478168661
now = 1421704478068663
last_log = 1421697060293018
pt = <optimized out>
closed = 1
listeners_disabled = 0
#3 0x0000000101a010a4 in ?? () from /lib/libthr.so.3
The impl_pollset_poll() has a timeout of 100ms set. Some data for the
kevent() args:
(gdb) print *pollset
$1 = {pool = 0x1020a1028, nelts = 0, nalloc = 256, flags = 19,
wakeup_pipe = {0x6174736e692d3265, 0x612e7777772f6c6c}, wakeup_pfd = {p
= 0x726f2e6568636170,
desc_type = 1868771175, reqevents = 26222, rtnevents = 25135, desc
= {f = 0x70695f64656e6e61, s = 0x70695f64656e6e61}, client_data =
0x666e6f632e73}, p = 0x1020a2e20,
provider = 0x1017d9480 <impl>}
(gdb) print *pollset->p
$2 = {kqueue_fd = 24, kevent = {ident = 479, filter = -1, flags = 2,
fflags = 0, data = 0, udata = 0x0}, setsize = 512, ke_set = 0x10940c028,
result_set = 0x109411028,
ring_lock = 0x1020a2e98, query_ring = {next = 0x1020a2e68, prev =
0x1020a2e68}, free_ring = {next = 0x1094c4138, prev = 0x11749c168},
dead_ring = {next = 0x1020a2e88,
prev = 0x1020a2e88}}
(gdb) print *pollset->p->ke_set
$3 = {ident = 479, filter = -1, flags = 32768, fflags = 0, data = 0,
udata = 0x11749c168}
CPU time very slowly increases, so I expect that the timeout for the
kqueue fires but the listener thread doesn't exit although
listener_may_exit is 1.
Regards,
Rainer