[ https://issues.apache.org/jira/browse/TS-4475?focusedWorklogId=26695&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-26695 ]
ASF GitHub Bot logged work on TS-4475: -------------------------------------- Author: ASF GitHub Bot Created on: 20/Aug/16 00:09 Start Date: 20/Aug/16 00:09 Worklog Time Spent: 10m Work Description: Github user shinrich commented on the issue: https://github.com/apache/trafficserver/pull/831 Looks good to me. I assume you set the very high timeout because the net connections for the log collation client aren't really like normal connections so the likely lower timeouts shouldn't apply. Is that correct? Issue Time Tracking ------------------- Worklog Id: (was: 26695) Time Spent: 1.5h (was: 1h 20m) > Crash in Log-Collation client after using inactivity-cop. > --------------------------------------------------------- > > Key: TS-4475 > URL: https://issues.apache.org/jira/browse/TS-4475 > Project: Traffic Server > Issue Type: Bug > Components: Logging > Affects Versions: 6.1.1 > Reporter: Peter Chou > Fix For: sometime > > Time Spent: 1.5h > Remaining Estimate: 0h > > Background: We recently tried making use of inactivity-cop by setting it to > 300s instead of the default one-day setting. This was to address an issue > where, under heavy load, ATS would become un-responsive to client requests, > and the condition would persist after traffic was stopped with the active > queue saying 0 connections but 'netstat -na' showing a bunch of established > connections (up to the throttle limit approximately). > Inactivity cop seemed to help ATS handle this situation, but we have since > experienced a couple of core dumps over the last four day period. It seems > occasionally the Log Collation Client State Machine will have event value 105 > or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() > it tries to call the continuation handler which down the line does not know > about this event thus causing core dump !"unexpcted state" [sic]. > Here is the back-trace -- > (gdb) bt > #0 0x00002b67cd5405f7 in raise () from /lib64/libc.so.6 > #1 0x00002b67cd541e28 in abort () from /lib64/libc.so.6 > #2 0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43 > #3 0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed > assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65 > #4 0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: > failed assert `%s`") at ink_error.cc:73 > #5 0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted > state\"", file=0x7fb35b "LogCollationClientSM.cc", > line=445) at ink_assert.cc:37 > #6 0x000000000069c86b in LogCollationClientSM::client_idle > (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445 > #7 0x000000000069b427 in LogCollationClientSM::client_handler > (this=0x2b681400bb00, event=105, data=0x2b680c017020) > at LogCollationClientSM.cc:119 > #8 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, > event=105, data=0x2b680c017020) > at ../iocore/eventsystem/I_Continuation.h:153 > #9 0x0000000000783d40 in read_signal_and_update (event=105, > vc=0x2b680c016f00) at UnixNetVConnection.cc:150 > #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, > event=1, e=0x127ad60) at UnixNetVConnection.cc:1188 > #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, > event=1, data=0x127ad60) > at ../iocore/eventsystem/I_Continuation.h:153 > #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, > event=2, e=0x127ad60) at UnixNet.cc:102 > #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, > data=0x127ad60) > at ../iocore/eventsystem/I_Continuation.h:153 > #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, > e=0x127ad60, calling_code=2) at UnixEThread.cc:128 > #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at > UnixEThread.cc:207 > #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918 > I believe it takes a wrong turn here -- > #9 0x0000000000783d40 in read_signal_and_update (event=105, > vc=0x2b680c016f00) at UnixNetVConnection.cc:150 > 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); > (gdb) list > 145 static inline int > 146 read_signal_and_update(int event, UnixNetVConnection *vc) > 147 { > 148 vc->recursion++; > 149 if (vc->read.vio._cont) { > 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); > 151 } else { > 152 switch (event) { > 153 case VC_EVENT_EOS: > 154 case VC_EVENT_ERROR: > (gdb) list > 155 case VC_EVENT_ACTIVE_TIMEOUT: > 156 case VC_EVENT_INACTIVITY_TIMEOUT: > 157 Debug("inactivity_cop", "event %d: null read.vio cont, closing > vc %p", event, vc); > 158 vc->closed = 1; > 159 break; > 160 default: > 161 Error("Unexpected event %d for vc %p", event, vc); > 162 ink_release_assert(0); > 163 break; > 164 } > Note: I understand that there were several issues related to TS-3196 > concerning inactivity_cop and this section of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)