[ 
https://issues.apache.org/jira/browse/TS-4916?focusedWorklogId=30697&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-30697
 ]

ASF GitHub Bot logged work on TS-4916:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Oct/16 22:15
            Start Date: 14/Oct/16 22:15
    Worklog Time Spent: 10m 
      Work Description: Github user gtenev commented on a diff in the pull 
request:

    https://github.com/apache/trafficserver/pull/1100#discussion_r83510837
  
    --- Diff: proxy/http2/Http2ConnectionState.cc ---
    @@ -936,30 +940,70 @@ Http2ConnectionState::cleanup_streams()
     void
     Http2ConnectionState::delete_stream(Http2Stream *stream)
     {
    +  // The following check allows the method to be called safely on already 
deleted streams.
    +  if (deleted_from_active_streams(stream)) {
    +    return;
    +  }
    +
    +  SCOPED_MUTEX_LOCK(lock, this->mutex, this_ethread());
    +
    --- End diff --
    
    If we are sure `DLL<>` is always protected by a lock then I must have 
really misunderstood this previous [comment on 
TS-4916](https://issues.apache.org/jira/browse/TS-4916?focusedCommentId=15552505&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15552505])
 where we suspected `DLL<>` to be “manipulated by simultaneous threads”.
    
    That would mean that at least in one thread was not holding the right lock. 
In that case would not that mean that “rearranging some of the stream_count 
book keeping” would rather hide the problem than to fix it?
    
    Trying to help based on the comment decided to trace few paths in the 
source code that may not be holding ConnectionState lock (theoretically) and 
grabbed the lock on the common path closest to the structures that needed 
protection (which based on my understanding should not be a problem if the 
thread is already holding the lock).
    
    Actually never noticed the race condition in my debugging so I am going to 
remove this line from the PR and I will consider limiting my future changes to 
the issue I am trying to fix.



Issue Time Tracking
-------------------

    Worklog Id:     (was: 30697)
    Time Spent: 4.5h  (was: 4h 20m)

> Http2ConnectionState::restart_streams infinite loop causes deadlock 
> --------------------------------------------------------------------
>
>                 Key: TS-4916
>                 URL: https://issues.apache.org/jira/browse/TS-4916
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core, HTTP/2
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>            Priority: Blocker
>             Fix For: 7.1.0
>
>          Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Http2ConnectionState::restart_streams falls into an infinite loop while 
> holding a lock, which leads to cache updates to start failing.
> The infinite loop is caused by traversing a list whose last element “next” 
> points to the element itself and the traversal never finishes.
> {code}
> Thread 51 (Thread 0x2aaab3d04700 (LWP 34270)):
> #0  0x00002aaaaacf3fee in Http2ConnectionState::restart_streams 
> (this=0x2ae6ba5284c8) at Http2ConnectionState.cc:913
> #1  rcv_window_update_frame (cstate=..., frame=...) at 
> Http2ConnectionState.cc:627
> #2  0x00002aaaaacf9738 in Http2ConnectionState::main_event_handler 
> (this=0x2ae6ba5284c8, event=<optimized out>, edata=<optimized out>) at 
> Http2ConnectionState.cc:823
> #3  0x00002aaaaacef1c3 in Continuation::handleEvent (data=0x2aaab3d039a0, 
> event=2253, this=0x2ae6ba5284c8) at 
> ../../iocore/eventsystem/I_Continuation.h:153
> #4  send_connection_event (cont=cont@entry=0x2ae6ba5284c8, 
> event=event@entry=2253, edata=edata@entry=0x2aaab3d039a0) at 
> Http2ClientSession.cc:58
> #5  0x00002aaaaacef462 in Http2ClientSession::state_complete_frame_read 
> (this=0x2ae6ba528290, event=<optimized out>, edata=0x2aab7b237f18) at 
> Http2ClientSession.cc:426
> #6  0x00002aaaaacf0982 in Continuation::handleEvent (data=0x2aab7b237f18, 
> event=100, this=0x2ae6ba528290) at 
> ../../iocore/eventsystem/I_Continuation.h:153
> #7  Http2ClientSession::state_start_frame_read (this=0x2ae6ba528290, 
> event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:399
> #8  0x00002aaaaacef5a3 in Continuation::handleEvent (data=0x2aab7b237f18, 
> event=100, this=0x2ae6ba528290) at 
> ../../iocore/eventsystem/I_Continuation.h:153
> #9  Http2ClientSession::state_complete_frame_read (this=0x2ae6ba528290, 
> event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:431
> #10 0x00002aaaaacf0982 in Continuation::handleEvent (data=0x2aab7b237f18, 
> event=100, this=0x2ae6ba528290) at 
> ../../iocore/eventsystem/I_Continuation.h:153
> #11 Http2ClientSession::state_start_frame_read (this=0x2ae6ba528290, 
> event=<optimized out>, edata=0x2aab7b237f18) at Http2ClientSession.cc:399
> #12 0x00002aaaaae67e2b in Continuation::handleEvent (data=0x2aab7b237f18, 
> event=100, this=<optimized out>) at 
> ../../iocore/eventsystem/I_Continuation.h:153
> #13 read_signal_and_update (vc=0x2aab7b237e00, vc@entry=0x1, 
> event=event@entry=100) at UnixNetVConnection.cc:153
> #14 UnixNetVConnection::readSignalAndUpdate (this=this@entry=0x2aab7b237e00, 
> event=event@entry=100) at UnixNetVConnection.cc:1036
> #15 0x00002aaaaae47653 in SSLNetVConnection::net_read_io 
> (this=0x2aab7b237e00, nh=0x2aaab2409cc0, lthread=0x2aaab2406000) at 
> SSLNetVConnection.cc:595
> #16 0x00002aaaaae5558c in NetHandler::mainNetEvent (this=0x2aaab2409cc0, 
> event=<optimized out>, e=<optimized out>) at UnixNet.cc:513
> #17 0x00002aaaaae8d2e6 in Continuation::handleEvent (data=0x2aaab0bfa700, 
> event=5, this=<optimized out>) at I_Continuation.h:153
> #18 EThread::process_event (calling_code=5, e=0x2aaab0bfa700, 
> this=0x2aaab2406000) at UnixEThread.cc:148
> #19 EThread::execute (this=0x2aaab2406000) at UnixEThread.cc:275
> #20 0x00002aaaaae8c0e6 in spawn_thread_internal (a=0x2aaab0b25bb0) at 
> Thread.cc:86
> #21 0x00002aaaad6b3aa1 in start_thread (arg=0x2aaab3d04700) at 
> pthread_create.c:301
> #22 0x00002aaaae8bc93d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
> {code}
> Here is the stream_list trace.
> {code}
> (gdb) thread 51
> [Switching to thread 51 (Thread 0x2aaab3d04700 (LWP 34270))]
> #0  0x00002aaaaacf3fee in Http2ConnectionState::restart_streams 
> (this=0x2ae6ba5284c8) at Http2ConnectionState.cc:913
> (gdb) trace_list stream_list
> ------- count=0 -------
> id=29
> this=0x2ae673f0c840
> next=0x2aaac05d8900
> prev=(nil)
> ------- count=1 -------
> id=27
> this=0x2aaac05d8900
> next=0x2ae5b6bbec00
> prev=0x2ae673f0c840
> ------- count=2 -------
> id=19
> this=0x2ae5b6bbec00
> next=0x2ae5b6bbec00
> prev=0x2aaac05d8900
> ------- count=3 -------
> id=19
> this=0x2ae5b6bbec00
> next=0x2ae5b6bbec00
> prev=0x2aaac05d8900
> . . . 
> ------- count=5560 -------
> id=19
> this=0x2ae5b6bbec00
> next=0x2ae5b6bbec00
> prev=0x2aaac05d8900
> . . .
> {code}
> Currently I am working on finding out why the list in question got into this 
> “impossible” (broken) state and and eventually coming up with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to