[
https://issues.apache.org/jira/browse/TS-4475?focusedWorklogId=28427&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-28427
]
ASF GitHub Bot logged work on TS-4475:
--------------------------------------
Author: ASF GitHub Bot
Created on: 08/Sep/16 12:55
Start Date: 08/Sep/16 12:55
Worklog Time Spent: 10m
Work Description: Github user oknet commented on the issue:
https://github.com/apache/trafficserver/pull/831
backport +1
Issue Time Tracking
-------------------
Worklog Id: (was: 28427)
Time Spent: 2.5h (was: 2h 20m)
> Crash in Log-Collation client after using inactivity-cop.
> ---------------------------------------------------------
>
> Key: TS-4475
> URL: https://issues.apache.org/jira/browse/TS-4475
> Project: Traffic Server
> Issue Type: Bug
> Components: Logging
> Affects Versions: 6.1.1
> Reporter: Peter Chou
> Fix For: sometime
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> Background: We recently tried making use of inactivity-cop by setting it to
> 300s instead of the default one-day setting. This was to address an issue
> where, under heavy load, ATS would become un-responsive to client requests,
> and the condition would persist after traffic was stopped with the active
> queue saying 0 connections but 'netstat -na' showing a bunch of established
> connections (up to the throttle limit approximately).
> Inactivity cop seemed to help ATS handle this situation, but we have since
> experienced a couple of core dumps over the last four day period. It seems
> occasionally the Log Collation Client State Machine will have event value 105
> or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update()
> it tries to call the continuation handler which down the line does not know
> about this event thus causing core dump !"unexpcted state" [sic].
> Here is the back-trace --
> (gdb) bt
> #0 0x00002b67cd5405f7 in raise () from /lib64/libc.so.6
> #1 0x00002b67cd541e28 in abort () from /lib64/libc.so.6
> #2 0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43
> #3 0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed
> assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65
> #4 0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d:
> failed assert `%s`") at ink_error.cc:73
> #5 0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted
> state\"", file=0x7fb35b "LogCollationClientSM.cc",
> line=445) at ink_assert.cc:37
> #6 0x000000000069c86b in LogCollationClientSM::client_idle
> (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445
> #7 0x000000000069b427 in LogCollationClientSM::client_handler
> (this=0x2b681400bb00, event=105, data=0x2b680c017020)
> at LogCollationClientSM.cc:119
> #8 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00,
> event=105, data=0x2b680c017020)
> at ../iocore/eventsystem/I_Continuation.h:153
> #9 0x0000000000783d40 in read_signal_and_update (event=105,
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00,
> event=1, e=0x127ad60) at UnixNetVConnection.cc:1188
> #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00,
> event=1, data=0x127ad60)
> at ../iocore/eventsystem/I_Continuation.h:153
> #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00,
> event=2, e=0x127ad60) at UnixNet.cc:102
> #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2,
> data=0x127ad60)
> at ../iocore/eventsystem/I_Continuation.h:153
> #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010,
> e=0x127ad60, calling_code=2) at UnixEThread.cc:128
> #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at
> UnixEThread.cc:207
> #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918
> I believe it takes a wrong turn here --
> #9 0x0000000000783d40 in read_signal_and_update (event=105,
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> (gdb) list
> 145 static inline int
> 146 read_signal_and_update(int event, UnixNetVConnection *vc)
> 147 {
> 148 vc->recursion++;
> 149 if (vc->read.vio._cont) {
> 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> 151 } else {
> 152 switch (event) {
> 153 case VC_EVENT_EOS:
> 154 case VC_EVENT_ERROR:
> (gdb) list
> 155 case VC_EVENT_ACTIVE_TIMEOUT:
> 156 case VC_EVENT_INACTIVITY_TIMEOUT:
> 157 Debug("inactivity_cop", "event %d: null read.vio cont, closing
> vc %p", event, vc);
> 158 vc->closed = 1;
> 159 break;
> 160 default:
> 161 Error("Unexpected event %d for vc %p", event, vc);
> 162 ink_release_assert(0);
> 163 break;
> 164 }
> Note: I understand that there were several issues related to TS-3196
> concerning inactivity_cop and this section of code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)