[
https://issues.apache.org/jira/browse/TS-4475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410320#comment-15410320
]
Peter Chou commented on TS-4475:
--------------------------------
[~oknet] based on input from Susan Hinrich, I just piggy-backed the
VC_EVENT_ACTIVE_TIMEOUT and VC_EVENT_INACTIVITY_TIMEOUT events with the actions
taken for EOS and ERROR in the switch statement. I also modified the debug
message accordingly. This is similar to what we originally had (see initial
topic comment above) with the addition of the VC_EVENT_ACTIVE_TIMEOUT as both
you and Susan suggested.
Can you take a look at what I did in client_open() with --
{noformat}
net_vc->set_inactivity_timeout(HRTIME_SECONDS(86400));
{noformat}
I am not sure if this is what you are recommending for step 2. It seems
sufficient to set the time-out for the net vc to the previous default of 86400.
I was not able to test this part in my development environment under "master",
but we'll back-port it to 6.1.x and test in our lab next week. Appreciate if
you can give an opinion on this whether it looks right and is in line with your
thinking.
> Crash in Log-Collation client after using inactivity-cop.
> ---------------------------------------------------------
>
> Key: TS-4475
> URL: https://issues.apache.org/jira/browse/TS-4475
> Project: Traffic Server
> Issue Type: Bug
> Components: Logging
> Affects Versions: 6.1.1
> Reporter: Peter Chou
> Fix For: sometime
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Background: We recently tried making use of inactivity-cop by setting it to
> 300s instead of the default one-day setting. This was to address an issue
> where, under heavy load, ATS would become un-responsive to client requests,
> and the condition would persist after traffic was stopped with the active
> queue saying 0 connections but 'netstat -na' showing a bunch of established
> connections (up to the throttle limit approximately).
> Inactivity cop seemed to help ATS handle this situation, but we have since
> experienced a couple of core dumps over the last four day period. It seems
> occasionally the Log Collation Client State Machine will have event value 105
> or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update()
> it tries to call the continuation handler which down the line does not know
> about this event thus causing core dump !"unexpcted state" [sic].
> Here is the back-trace --
> (gdb) bt
> #0 0x00002b67cd5405f7 in raise () from /lib64/libc.so.6
> #1 0x00002b67cd541e28 in abort () from /lib64/libc.so.6
> #2 0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43
> #3 0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed
> assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65
> #4 0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d:
> failed assert `%s`") at ink_error.cc:73
> #5 0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted
> state\"", file=0x7fb35b "LogCollationClientSM.cc",
> line=445) at ink_assert.cc:37
> #6 0x000000000069c86b in LogCollationClientSM::client_idle
> (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445
> #7 0x000000000069b427 in LogCollationClientSM::client_handler
> (this=0x2b681400bb00, event=105, data=0x2b680c017020)
> at LogCollationClientSM.cc:119
> #8 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00,
> event=105, data=0x2b680c017020)
> at ../iocore/eventsystem/I_Continuation.h:153
> #9 0x0000000000783d40 in read_signal_and_update (event=105,
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00,
> event=1, e=0x127ad60) at UnixNetVConnection.cc:1188
> #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00,
> event=1, data=0x127ad60)
> at ../iocore/eventsystem/I_Continuation.h:153
> #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00,
> event=2, e=0x127ad60) at UnixNet.cc:102
> #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2,
> data=0x127ad60)
> at ../iocore/eventsystem/I_Continuation.h:153
> #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010,
> e=0x127ad60, calling_code=2) at UnixEThread.cc:128
> #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at
> UnixEThread.cc:207
> #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918
> I believe it takes a wrong turn here --
> #9 0x0000000000783d40 in read_signal_and_update (event=105,
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> (gdb) list
> 145 static inline int
> 146 read_signal_and_update(int event, UnixNetVConnection *vc)
> 147 {
> 148 vc->recursion++;
> 149 if (vc->read.vio._cont) {
> 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> 151 } else {
> 152 switch (event) {
> 153 case VC_EVENT_EOS:
> 154 case VC_EVENT_ERROR:
> (gdb) list
> 155 case VC_EVENT_ACTIVE_TIMEOUT:
> 156 case VC_EVENT_INACTIVITY_TIMEOUT:
> 157 Debug("inactivity_cop", "event %d: null read.vio cont, closing
> vc %p", event, vc);
> 158 vc->closed = 1;
> 159 break;
> 160 default:
> 161 Error("Unexpected event %d for vc %p", event, vc);
> 162 ink_release_assert(0);
> 163 break;
> 164 }
> Note: I understand that there were several issues related to TS-3196
> concerning inactivity_cop and this section of code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)