We have been discovering and fixing places where hostdb was invoking handlers 
against HttpSM without a lock.

* 55ba409333692db22762f84567f106eac34963e3
* https://github.com/apache/trafficserver/pull/4142

As we have been running this fix in production, our set of crashes has changed. 
 Specifically we were seeing HOSTDB_LOOKUP events showing up in 
HttpSM::state_http_server_open and causing an unexpected event assert.  I think 
these changes were pushing cache changes down the pipe.  Looking more closely 
at the cores, the HOSTDB_LOOKUP event was being invoked against a stale HttpSM. 
 Looking further down the stack, the hostname that DNS was using was different 
that the server hostname referenced in HttpSM.  So it seems that there was a 
pending hostdb action that wasn't canceled before the origin HttpSM object was 
destroyed.  The other exciting observation was that the cores seemed to happen 
in groups across machines implying there was some odd off box actions affecting 
the ATS flow.

Here is an example of that stack.
```
#0  0x00002b42cfc661d7 in raise () from /lib64/libc.so.6
#1  0x00002b42cfc678c8 in abort () from /lib64/libc.so.6
#2  0x00002b42cd183277 in ink_abort (message_format=0x2b42cd1ac080 "%s:%d: 
failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99
#3  0x00002b42cd1803c7 in _ink_assert (expression=0x864dc7 "0", file=0x864d98 
"../../../../trafficserver/proxy/http/HttpSM.cc", line=1941) at 
../../../../trafficserver/lib/ts/ink_assert.cc:37
#4  0x0000000000657bf7 in HttpSM::state_http_server_open (this=0x2b4331c64400, 
event=500, data=0x13ccc00) at 
../../../../trafficserver/proxy/http/HttpSM.cc:1941
#5  0x000000000065aec1 in HttpSM::main_handler (this=0x2b4331c64400, event=500, 
data=0x13ccc00) at ../../../../trafficserver/proxy/http/HttpSM.cc:2802
#6  0x0000000000830ff5 in Continuation::handleEvent (this=0x2b4331c64400, 
event=500, data=0x13ccc00) at 
../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#7  0x000000000075101f in reply_to_cont (cont=0x2b4331c64400, r=0x13ccc00, 
is_srv=false) at ../../../../trafficserver/iocore/hostdb/HostDB.cc:491
#8  0x0000000000755886 in HostDBContinuation::dnsEvent (this=0x2b42cd0fd9d0, 
event=600, e=0x2b435c127440) at 
../../../../trafficserver/iocore/hostdb/HostDB.cc:1470
#9  0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42cd0fd9d0, 
event=600, data=0x2b435c127440) at 
../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#10 0x00000000007687f6 in DNSEntry::postEvent (this=0x2b42f01bce00) at 
../../../../trafficserver/iocore/dns/DNS.cc:1282
#11 0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42f01bce00, 
event=1, data=0x2b4333faa6e0) at 
../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#12 0x000000000083312b in EThread::process_event (this=0x2b42d26bb010, 
e=0x2b4333faa6e0, calling_code=1) at 
../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:132
#13 0x0000000000833302 in EThread::process_queue (this=0x2b42d26bb010, 
NegativeQueue=0x2b42f54a3e10, ev_count=0x2b42f54a3e08, nq_count=0x2b42f54a3e0c) 
at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:171
#14 0x00000000008335ca in EThread::execute_regular (this=0x2b42d26bb010) at 
../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:231
#15 0x0000000000833949 in EThread::execute (this=0x2b42d26bb010) at 
../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#16 0x000000000083257e in spawn_thread_internal (a=0x1247070) at 
../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#17 0x00002b42ceff9dc5 in start_thread () from /lib64/libpthread.so.0
#18 0x00002b42cfd2876d in clone () from /lib64/libc.so.6
```

I created another build that added ink_release asserts on pending_action in 
HttpSM::state_http_server_open.  The assertion is that pending_action should be 
nullptr before reassigning it.  If you reassign a pending_action that is not 
null, you lose the reference to that action and you won't be able to cancel 
that action when HttpSM shuts down.  From that build, we were able to get core 
with the original HttpSM.   Here is an example core.

```
#0  0x00002b4c445cb1d7 in raise () from /lib64/libc.so.6
#1  0x00002b4c445cc8c8 in abort () from /lib64/libc.so.6
#2  0x00002b4c41ae8277 in ink_abort (message_format=0x2b4c41b11080 "%s:%d: 
failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99
#3  0x00002b4c41ae53c7 in _ink_assert (expression=0x865780 "event == 
EVENT_INTERVAL || pending_action == nullptr", file=0x864c38 
"../../../../trafficserver/proxy/http/HttpSM.cc", line=1833)
    at ../../../../trafficserver/lib/ts/ink_assert.cc:37
#4  0x00000000006570c5 in HttpSM::state_http_server_open (this=0x2b4c84697c00, 
event=3, data=0x2b4e64f22f90) at 
../../../../trafficserver/proxy/http/HttpSM.cc:1833
#5  0x000000000065ad79 in HttpSM::main_handler (this=0x2b4c84697c00, event=3, 
data=0x2b4e64f22f90) at ../../../../trafficserver/proxy/http/HttpSM.cc:2799
#6  0x0000000000830eb9 in Continuation::handleEvent (this=0x2b4c84697c00, 
event=3, data=0x2b4e64f22f90) at 
../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#7  0x000000000080de66 in read_signal_and_update (event=3, vc=0x2b4e64f22e40) 
at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:144
#8  0x000000000080e1e1 in read_signal_done (event=3, nh=0x2b4c47998cf0, 
vc=0x2b4e64f22e40) at 
../../../../trafficserver/iocore/net/UnixNetVConnection.cc:205
#9  0x000000000080e299 in read_signal_error (nh=0x2b4c47998cf0, 
vc=0x2b4e64f22e40, lerrno=0) at 
../../../../trafficserver/iocore/net/UnixNetVConnection.cc:229
#10 0x0000000000810eb3 in UnixNetVConnection::readSignalError 
(this=0x2b4e64f22e40, nh=0x2b4c47998cf0, err=0) at 
../../../../trafficserver/iocore/net/UnixNetVConnection.cc:1090
#11 0x00000000007e8617 in SSLNetVConnection::net_read_io (this=0x2b4e64f22e40, 
nh=0x2b4c47998cf0, lthread=0x2b4c47995010) at 
../../../../trafficserver/iocore/net/SSLNetVConnection.cc:541
#12 0x0000000000805f4f in NetHandler::waitForActivity (this=0x2b4c47998cf0, 
timeout=60000000) at ../../../../trafficserver/iocore/net/UnixNet.cc:497
#13 0x0000000000833646 in EThread::execute_regular (this=0x2b4c47995010) at 
../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:273
#14 0x000000000083380d in EThread::execute (this=0x2b4c47995010) at 
../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#15 0x0000000000832442 in spawn_thread_internal (a=0x120bee0) at 
../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#16 0x00002b4c4395edc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00002b4c4468d76d in clone () from /lib64/libc.so.6
```
In this case, the last bit of history on the HttpSM is interesting.
```
{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", 
event = 200, reentrancy = 2}, 
{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", 
event = 105, reentrancy = 1}, 
{fileline = 0x868a28 "../../../../trafficserver/proxy/http/HttpSM.cc:7745", 
event = 65535, reentrancy = 1},
{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", 
event = 3, reentrancy = 1},
```
The first line says that we called state_http_server_open with a NET_OPEN 
event. Basically ATS has sent the SYN to the origin.

The second line says that state_http_server_open was called with an inactivity 
timeout error.

The third line is setting the default handler to HttpSM::state_mark_os_down.

But the first line is back at state_http_server_open with a EVENT_ERROR. Which 
is odd because we have already done the error stuff for that socket.

Looking into the code, the error case will try additional addresses for the 
origin. This may just pick up the next address in the cached record, but if we 
are unlucky the cached DNS info may have expired in the 30-40 seconds since we 
started. If we go through the error case again, we may set up two dns requests 
for the same state machine. It is quite possible that the state machine will be 
long gone by the time the second DNS request goes off. If we have overwritten 
the pending_action at the start of HttpSM::state_http_server_open, there is no 
way to cancel the event before the state machine goes away.

To address this issue and avoid the double error event,  the right thing to do 
is call release_server_session() in the error case of 
HttpSM::state_http_server_open. If it session has read no response, that 
function will call do_io_close on the server vc (freeing the backing memory) 
and remove the related server_entry structure and epoll entries.

We have been running this code in one colo since Thursday night and have had no 
crashes.  Where before we were getting crashes very regularly during high 
traffic time (presumably because of overloaded origins).


[ Full content available at: https://github.com/apache/trafficserver/pull/4166 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to