[ 
https://issues.apache.org/jira/browse/TS-4475?focusedWorklogId=28720&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-28720
 ]

ASF GitHub Bot logged work on TS-4475:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Sep/16 12:36
            Start Date: 10/Sep/16 12:36
    Worklog Time Spent: 10m 
      Work Description: Github user shinrich closed the pull request at:

    https://github.com/apache/trafficserver/pull/831


Issue Time Tracking
-------------------

    Worklog Id:     (was: 28720)
    Time Spent: 3h  (was: 2h 50m)

> Crash in Log-Collation client after using inactivity-cop.
> ---------------------------------------------------------
>
>                 Key: TS-4475
>                 URL: https://issues.apache.org/jira/browse/TS-4475
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Logging
>    Affects Versions: 6.1.1
>            Reporter: Peter Chou
>             Fix For: sometime
>
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> Background: We recently tried making use of inactivity-cop by setting it to 
> 300s instead of the default one-day setting. This was to address an issue 
> where, under heavy load, ATS would become un-responsive to client requests, 
> and the condition would persist after traffic was stopped with the active 
> queue saying 0 connections but 'netstat -na' showing a bunch of established 
> connections (up to the throttle limit approximately).
> Inactivity cop seemed to help ATS handle this situation, but we have since 
> experienced a couple of core dumps over the last four day period. It seems 
> occasionally the Log Collation Client State Machine will have event value 105 
> or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() 
> it tries to call the continuation handler which down the line does not know 
> about this event thus causing core dump !"unexpcted state" [sic].
> Here is the back-trace --
> (gdb) bt
> #0  0x00002b67cd5405f7 in raise () from /lib64/libc.so.6
> #1  0x00002b67cd541e28 in abort () from /lib64/libc.so.6
> #2  0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43
> #3  0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed 
> assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65
> #4  0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: 
> failed assert `%s`") at ink_error.cc:73
> #5  0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted 
> state\"", file=0x7fb35b "LogCollationClientSM.cc",
>     line=445) at ink_assert.cc:37
> #6  0x000000000069c86b in LogCollationClientSM::client_idle 
> (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445
> #7  0x000000000069b427 in LogCollationClientSM::client_handler 
> (this=0x2b681400bb00, event=105, data=0x2b680c017020)
>     at LogCollationClientSM.cc:119
> #8  0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, 
> event=105, data=0x2b680c017020)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, 
> event=1, e=0x127ad60) at UnixNetVConnection.cc:1188
> #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, 
> event=1, data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, 
> event=2, e=0x127ad60) at UnixNet.cc:102
> #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, 
> data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, 
> e=0x127ad60, calling_code=2) at UnixEThread.cc:128
> #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at 
> UnixEThread.cc:207
> #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918
> I believe it takes a wrong turn here --
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> (gdb) list
> 145     static inline int
> 146     read_signal_and_update(int event, UnixNetVConnection *vc)
> 147     {
> 148       vc->recursion++;
> 149       if (vc->read.vio._cont) {
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> 151       } else {
> 152         switch (event) {
> 153         case VC_EVENT_EOS:
> 154         case VC_EVENT_ERROR:
> (gdb) list
> 155         case VC_EVENT_ACTIVE_TIMEOUT:
> 156         case VC_EVENT_INACTIVITY_TIMEOUT:
> 157           Debug("inactivity_cop", "event %d: null read.vio cont, closing 
> vc %p", event, vc);
> 158           vc->closed = 1;
> 159           break;
> 160         default:
> 161           Error("Unexpected event %d for vc %p", event, vc);
> 162           ink_release_assert(0);
> 163           break;
> 164         }
> Note: I understand that there were several issues related to TS-3196 
> concerning inactivity_cop and this section of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to