[jira] [Commented] (TS-4475) Crash in Log-Collation client after using inactivity-cop.

Peter Chou (JIRA) Fri, 05 Aug 2016 17:24:56 -0700

    [ 
https://issues.apache.org/jira/browse/TS-4475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410320#comment-15410320
 ]


Peter Chou commented on TS-4475:
--------------------------------

[~oknet] based on input from Susan Hinrich, I just piggy-backed the 
VC_EVENT_ACTIVE_TIMEOUT and VC_EVENT_INACTIVITY_TIMEOUT events with the actions 
taken for EOS and ERROR in the switch statement. I also modified the debug 
message accordingly. This is similar to what we originally had (see initial 
topic comment above) with the addition of the VC_EVENT_ACTIVE_TIMEOUT as both 
you and Susan suggested.

Can you take a look at what I did in client_open() with --
{noformat}
net_vc->set_inactivity_timeout(HRTIME_SECONDS(86400));
{noformat}
I am not sure if this is what you are recommending for step 2. It seems 
sufficient to set the time-out for the net vc to the previous default of 86400. 
I was not able to test this part in my development environment under "master", 
but we'll back-port it to 6.1.x and test in our lab next week. Appreciate if 
you can give an opinion on this whether it looks right and is in line with your 
thinking.

> Crash in Log-Collation client after using inactivity-cop.
> ---------------------------------------------------------
>
>                 Key: TS-4475
>                 URL: https://issues.apache.org/jira/browse/TS-4475
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Logging
>    Affects Versions: 6.1.1
>            Reporter: Peter Chou
>             Fix For: sometime
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Background: We recently tried making use of inactivity-cop by setting it to 
> 300s instead of the default one-day setting. This was to address an issue 
> where, under heavy load, ATS would become un-responsive to client requests, 
> and the condition would persist after traffic was stopped with the active 
> queue saying 0 connections but 'netstat -na' showing a bunch of established 
> connections (up to the throttle limit approximately).
> Inactivity cop seemed to help ATS handle this situation, but we have since 
> experienced a couple of core dumps over the last four day period. It seems 
> occasionally the Log Collation Client State Machine will have event value 105 
> or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() 
> it tries to call the continuation handler which down the line does not know 
> about this event thus causing core dump !"unexpcted state" [sic].
> Here is the back-trace --
> (gdb) bt
> #0  0x00002b67cd5405f7 in raise () from /lib64/libc.so.6
> #1  0x00002b67cd541e28 in abort () from /lib64/libc.so.6
> #2  0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43
> #3  0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed 
> assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65
> #4  0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: 
> failed assert `%s`") at ink_error.cc:73
> #5  0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted 
> state\"", file=0x7fb35b "LogCollationClientSM.cc",
>     line=445) at ink_assert.cc:37
> #6  0x000000000069c86b in LogCollationClientSM::client_idle 
> (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445
> #7  0x000000000069b427 in LogCollationClientSM::client_handler 
> (this=0x2b681400bb00, event=105, data=0x2b680c017020)
>     at LogCollationClientSM.cc:119
> #8  0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, 
> event=105, data=0x2b680c017020)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, 
> event=1, e=0x127ad60) at UnixNetVConnection.cc:1188
> #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, 
> event=1, data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, 
> event=2, e=0x127ad60) at UnixNet.cc:102
> #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, 
> data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, 
> e=0x127ad60, calling_code=2) at UnixEThread.cc:128
> #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at 
> UnixEThread.cc:207
> #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918
> I believe it takes a wrong turn here --
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> (gdb) list
> 145     static inline int
> 146     read_signal_and_update(int event, UnixNetVConnection *vc)
> 147     {
> 148       vc->recursion++;
> 149       if (vc->read.vio._cont) {
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> 151       } else {
> 152         switch (event) {
> 153         case VC_EVENT_EOS:
> 154         case VC_EVENT_ERROR:
> (gdb) list
> 155         case VC_EVENT_ACTIVE_TIMEOUT:
> 156         case VC_EVENT_INACTIVITY_TIMEOUT:
> 157           Debug("inactivity_cop", "event %d: null read.vio cont, closing 
> vc %p", event, vc);
> 158           vc->closed = 1;
> 159           break;
> 160         default:
> 161           Error("Unexpected event %d for vc %p", event, vc);
> 162           ink_release_assert(0);
> 163           break;
> 164         }
> Note: I understand that there were several issues related to TS-3196 
> concerning inactivity_cop and this section of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TS-4475) Crash in Log-Collation client after using inactivity-cop.

Reply via email to