We have been discovering and fixing places where hostdb was invoking handlers against HttpSM without a lock.
* 55ba409333692db22762f84567f106eac34963e3 * https://github.com/apache/trafficserver/pull/4142 As we have been running this fix in production, our set of crashes has changed. Specifically we were seeing HOSTDB_LOOKUP events showing up in HttpSM::state_http_server_open and causing an unexpected event assert. I think these changes were pushing cache changes down the pipe. Looking more closely at the cores, the HOSTDB_LOOKUP event was being invoked against a stale HttpSM. Looking further down the stack, the hostname that DNS was using was different that the server hostname referenced in HttpSM. So it seems that there was a pending hostdb action that wasn't canceled before the origin HttpSM object was destroyed. The other exciting observation was that the cores seemed to happen in groups across machines implying there was some odd off box actions affecting the ATS flow. Here is an example of that stack. ``` #0 0x00002b42cfc661d7 in raise () from /lib64/libc.so.6 #1 0x00002b42cfc678c8 in abort () from /lib64/libc.so.6 #2 0x00002b42cd183277 in ink_abort (message_format=0x2b42cd1ac080 "%s:%d: failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99 #3 0x00002b42cd1803c7 in _ink_assert (expression=0x864dc7 "0", file=0x864d98 "../../../../trafficserver/proxy/http/HttpSM.cc", line=1941) at ../../../../trafficserver/lib/ts/ink_assert.cc:37 #4 0x0000000000657bf7 in HttpSM::state_http_server_open (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/proxy/http/HttpSM.cc:1941 #5 0x000000000065aec1 in HttpSM::main_handler (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/proxy/http/HttpSM.cc:2802 #6 0x0000000000830ff5 in Continuation::handleEvent (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33 #7 0x000000000075101f in reply_to_cont (cont=0x2b4331c64400, r=0x13ccc00, is_srv=false) at ../../../../trafficserver/iocore/hostdb/HostDB.cc:491 #8 0x0000000000755886 in HostDBContinuation::dnsEvent (this=0x2b42cd0fd9d0, event=600, e=0x2b435c127440) at ../../../../trafficserver/iocore/hostdb/HostDB.cc:1470 #9 0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42cd0fd9d0, event=600, data=0x2b435c127440) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33 #10 0x00000000007687f6 in DNSEntry::postEvent (this=0x2b42f01bce00) at ../../../../trafficserver/iocore/dns/DNS.cc:1282 #11 0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42f01bce00, event=1, data=0x2b4333faa6e0) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33 #12 0x000000000083312b in EThread::process_event (this=0x2b42d26bb010, e=0x2b4333faa6e0, calling_code=1) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:132 #13 0x0000000000833302 in EThread::process_queue (this=0x2b42d26bb010, NegativeQueue=0x2b42f54a3e10, ev_count=0x2b42f54a3e08, nq_count=0x2b42f54a3e0c) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:171 #14 0x00000000008335ca in EThread::execute_regular (this=0x2b42d26bb010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:231 #15 0x0000000000833949 in EThread::execute (this=0x2b42d26bb010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326 #16 0x000000000083257e in spawn_thread_internal (a=0x1247070) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85 #17 0x00002b42ceff9dc5 in start_thread () from /lib64/libpthread.so.0 #18 0x00002b42cfd2876d in clone () from /lib64/libc.so.6 ``` I created another build that added ink_release asserts on pending_action in HttpSM::state_http_server_open. The assertion is that pending_action should be nullptr before reassigning it. If you reassign a pending_action that is not null, you lose the reference to that action and you won't be able to cancel that action when HttpSM shuts down. From that build, we were able to get core with the original HttpSM. Here is an example core. ``` #0 0x00002b4c445cb1d7 in raise () from /lib64/libc.so.6 #1 0x00002b4c445cc8c8 in abort () from /lib64/libc.so.6 #2 0x00002b4c41ae8277 in ink_abort (message_format=0x2b4c41b11080 "%s:%d: failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99 #3 0x00002b4c41ae53c7 in _ink_assert (expression=0x865780 "event == EVENT_INTERVAL || pending_action == nullptr", file=0x864c38 "../../../../trafficserver/proxy/http/HttpSM.cc", line=1833) at ../../../../trafficserver/lib/ts/ink_assert.cc:37 #4 0x00000000006570c5 in HttpSM::state_http_server_open (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/proxy/http/HttpSM.cc:1833 #5 0x000000000065ad79 in HttpSM::main_handler (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/proxy/http/HttpSM.cc:2799 #6 0x0000000000830eb9 in Continuation::handleEvent (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33 #7 0x000000000080de66 in read_signal_and_update (event=3, vc=0x2b4e64f22e40) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:144 #8 0x000000000080e1e1 in read_signal_done (event=3, nh=0x2b4c47998cf0, vc=0x2b4e64f22e40) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:205 #9 0x000000000080e299 in read_signal_error (nh=0x2b4c47998cf0, vc=0x2b4e64f22e40, lerrno=0) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:229 #10 0x0000000000810eb3 in UnixNetVConnection::readSignalError (this=0x2b4e64f22e40, nh=0x2b4c47998cf0, err=0) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:1090 #11 0x00000000007e8617 in SSLNetVConnection::net_read_io (this=0x2b4e64f22e40, nh=0x2b4c47998cf0, lthread=0x2b4c47995010) at ../../../../trafficserver/iocore/net/SSLNetVConnection.cc:541 #12 0x0000000000805f4f in NetHandler::waitForActivity (this=0x2b4c47998cf0, timeout=60000000) at ../../../../trafficserver/iocore/net/UnixNet.cc:497 #13 0x0000000000833646 in EThread::execute_regular (this=0x2b4c47995010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:273 #14 0x000000000083380d in EThread::execute (this=0x2b4c47995010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326 #15 0x0000000000832442 in spawn_thread_internal (a=0x120bee0) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85 #16 0x00002b4c4395edc5 in start_thread () from /lib64/libpthread.so.0 #17 0x00002b4c4468d76d in clone () from /lib64/libc.so.6 ``` In this case, the last bit of history on the HttpSM is interesting. ``` {fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 200, reentrancy = 2}, {fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 105, reentrancy = 1}, {fileline = 0x868a28 "../../../../trafficserver/proxy/http/HttpSM.cc:7745", event = 65535, reentrancy = 1}, {fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 3, reentrancy = 1}, ``` The first line says that we called state_http_server_open with a NET_OPEN event. Basically ATS has sent the SYN to the origin. The second line says that state_http_server_open was called with an inactivity timeout error. The third line is setting the default handler to HttpSM::state_mark_os_down. But the first line is back at state_http_server_open with a EVENT_ERROR. Which is odd because we have already done the error stuff for that socket. Looking into the code, the error case will try additional addresses for the origin. This may just pick up the next address in the cached record, but if we are unlucky the cached DNS info may have expired in the 30-40 seconds since we started. If we go through the error case again, we may set up two dns requests for the same state machine. It is quite possible that the state machine will be long gone by the time the second DNS request goes off. If we have overwritten the pending_action at the start of HttpSM::state_http_server_open, there is no way to cancel the event before the state machine goes away. To address this issue and avoid the double error event, the right thing to do is call release_server_session() in the error case of HttpSM::state_http_server_open. If it session has read no response, that function will call do_io_close on the server vc (freeing the backing memory) and remove the related server_entry structure and epoll entries. We have been running this code in one colo since Thursday night and have had no crashes. Where before we were getting crashes very regularly during high traffic time (presumably because of overloaded origins). [ Full content available at: https://github.com/apache/trafficserver/pull/4166 ] This message was relayed via gitbox.apache.org for [email protected]
