[ 
https://issues.apache.org/jira/browse/TS-4475?focusedWorklogId=26695&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-26695
 ]

ASF GitHub Bot logged work on TS-4475:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Aug/16 00:09
            Start Date: 20/Aug/16 00:09
    Worklog Time Spent: 10m 
      Work Description: Github user shinrich commented on the issue:

    https://github.com/apache/trafficserver/pull/831
  
    Looks good to me.  I assume you set the very high timeout because the net 
connections for the log collation client aren't really like normal connections 
so the likely lower timeouts shouldn't apply.  Is that correct?


Issue Time Tracking
-------------------

    Worklog Id:     (was: 26695)
    Time Spent: 1.5h  (was: 1h 20m)

> Crash in Log-Collation client after using inactivity-cop.
> ---------------------------------------------------------
>
>                 Key: TS-4475
>                 URL: https://issues.apache.org/jira/browse/TS-4475
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Logging
>    Affects Versions: 6.1.1
>            Reporter: Peter Chou
>             Fix For: sometime
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Background: We recently tried making use of inactivity-cop by setting it to 
> 300s instead of the default one-day setting. This was to address an issue 
> where, under heavy load, ATS would become un-responsive to client requests, 
> and the condition would persist after traffic was stopped with the active 
> queue saying 0 connections but 'netstat -na' showing a bunch of established 
> connections (up to the throttle limit approximately).
> Inactivity cop seemed to help ATS handle this situation, but we have since 
> experienced a couple of core dumps over the last four day period. It seems 
> occasionally the Log Collation Client State Machine will have event value 105 
> or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() 
> it tries to call the continuation handler which down the line does not know 
> about this event thus causing core dump !"unexpcted state" [sic].
> Here is the back-trace --
> (gdb) bt
> #0  0x00002b67cd5405f7 in raise () from /lib64/libc.so.6
> #1  0x00002b67cd541e28 in abort () from /lib64/libc.so.6
> #2  0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43
> #3  0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed 
> assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65
> #4  0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: 
> failed assert `%s`") at ink_error.cc:73
> #5  0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted 
> state\"", file=0x7fb35b "LogCollationClientSM.cc",
>     line=445) at ink_assert.cc:37
> #6  0x000000000069c86b in LogCollationClientSM::client_idle 
> (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445
> #7  0x000000000069b427 in LogCollationClientSM::client_handler 
> (this=0x2b681400bb00, event=105, data=0x2b680c017020)
>     at LogCollationClientSM.cc:119
> #8  0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, 
> event=105, data=0x2b680c017020)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, 
> event=1, e=0x127ad60) at UnixNetVConnection.cc:1188
> #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, 
> event=1, data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, 
> event=2, e=0x127ad60) at UnixNet.cc:102
> #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, 
> data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, 
> e=0x127ad60, calling_code=2) at UnixEThread.cc:128
> #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at 
> UnixEThread.cc:207
> #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918
> I believe it takes a wrong turn here --
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> (gdb) list
> 145     static inline int
> 146     read_signal_and_update(int event, UnixNetVConnection *vc)
> 147     {
> 148       vc->recursion++;
> 149       if (vc->read.vio._cont) {
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> 151       } else {
> 152         switch (event) {
> 153         case VC_EVENT_EOS:
> 154         case VC_EVENT_ERROR:
> (gdb) list
> 155         case VC_EVENT_ACTIVE_TIMEOUT:
> 156         case VC_EVENT_INACTIVITY_TIMEOUT:
> 157           Debug("inactivity_cop", "event %d: null read.vio cont, closing 
> vc %p", event, vc);
> 158           vc->closed = 1;
> 159           break;
> 160         default:
> 161           Error("Unexpected event %d for vc %p", event, vc);
> 162           ink_release_assert(0);
> 163           break;
> 164         }
> Note: I understand that there were several issues related to TS-3196 
> concerning inactivity_cop and this section of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to