[jira] [Comment Edited] (TS-3266) core dump in UnixNetProcessor::connect_re_internal

Susan Hinrichs (JIRA) Wed, 17 Jun 2015 09:14:10 -0700

    [ 
https://issues.apache.org/jira/browse/TS-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589184#comment-14589184
 ]


Susan Hinrichs edited comment on TS-3266 at 6/17/15 4:10 PM:
-------------------------------------------------------------

Uploaded a new version of ts-3266.diff

Tracked down the other mutex on the server side read.vio to be the mutex 
associated with the server session pool.

The first attempt to communicate with the origin server pulled a session out of 
the session pool.  When a session is put into the pool, a do_io_read is set up 
to catch anything that the server might send (e.g. EOS).  The do_io_read is set 
with the mutex from the sever pool.

Sometimes you get very unlucky, and the EOS gets sent as a session is being 
resurrected.  In the original code, the released do_io_read was not replaced 
until the call to HttpSM::attach_server_session.   So the net_io_read called 
for the EOS would use the server pull mutex.  In parallel the attach_server 
session logic would continue and issue a new do_io_read under the HttpSM lock.  
Since the threads are fetching different locks, they would execute in parallel 
and overwrite the same data structures.

The solution is to cancel the session pool do_io_read under the session pool 
lock as it is being reactivated.  Then there is no window of opportunity for a 
stray event to be caught by the session pool do_io_read in parallel with 
activating the new session.


was (Author: shinrich):
Uploaded a new version of ts-3266.diff

Tracked down the other mutex on the server side read.vio to be the mutex 
associated with the server session pool.

The first attempt to communication with the origin server pulled a session out 
of the session pool.  When a session is put into the pool, a do_io_read is set 
up to catch anything that the server might send (e.g. EOS).  The do_io_read is 
set with the mutex from the sever pool.

Sometimes you get very unlucky, and the EOS gets sent as a session is being 
resurrected.  In the original code, the released do_io_read was not replaced 
until the call to HttpSM::attach_server_session.   So the net_io_read called 
for the EOS would use the server pull mutex.  In parallel the attach_server 
session logic would continue and issue a new do_io_read under the HttpSM lock.  
Since the threads are fetching different locks, they would execute in parallel 
and overwrite the same data structures.

The solution is to cancel the session pool do_io_read under the session pool 
lock as it is being reactivated.  Then there is no window of opportunity for a 
stray event to be caught by the session pool do_io_read in parallel with 
activating the new session.

> core dump in UnixNetProcessor::connect_re_internal
> --------------------------------------------------
>
>                 Key: TS-3266
>                 URL: https://issues.apache.org/jira/browse/TS-3266
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 5.2.0, 5.3.0
>            Reporter: Sudheer Vinukonda
>            Assignee: Susan Hinrichs
>              Labels: crash
>         Attachments: ts-3266.diff
>
>
> See a new core dump in v5.2.0 after running stable for over 48 hours. Below 
> is the bt and some gdb info.
> {code}
> (gdb) bt
> #0  0x0000000000773056 in EThread::is_event_type (this=0x0, et=2) at 
> UnixEThread.cc:121
> #1  0x0000000000750cfe in UnixNetProcessor::connect_re_internal 
> (this=0x1032fc0, cont=0x2aad4590d080, target=0x2aad4590d728, 
> opt=0x2aac8b367600) at UnixNetProcessor.cc:247
> #2  0x000000000052b498 in NetProcessor::connect_re (this=0x1032fc0, 
> cont=0x2aad4590d080, addr=0x2aad4590d728, opts=0x2aac8b367600) at 
> ../iocore/net/P_UnixNetProcessor.h:85
> #3  0x00000000005e64e5 in HttpSM::do_http_server_open (this=0x2aad4590d080, 
> raw=false) at HttpSM.cc:4796
> #4  0x00000000005edec7 in HttpSM::set_next_state (this=0x2aad4590d080) at 
> HttpSM.cc:7141
> #5  0x00000000005ed2f2 in HttpSM::call_transact_and_set_next_state 
> (this=0x2aad4590d080, f=0x607320 
> <HttpTransact::HandleResponse(HttpTransact::State*)>) at HttpSM.cc:6961
> #6  0x00000000005e7b72 in HttpSM::handle_server_setup_error 
> (this=0x2aad4590d080, event=104, data=0x2aade42edae8) at HttpSM.cc:5308
> #7  0x00000000005dc57c in HttpSM::state_send_server_request_header 
> (this=0x2aad4590d080, event=104, data=0x2aade42edae8) at HttpSM.cc:1989
> #8  0x00000000005de6a2 in HttpSM::main_handler (this=0x2aad4590d080, 
> event=104, data=0x2aade42edae8) at HttpSM.cc:2570
> #9  0x0000000000502eae in Continuation::handleEvent (this=0x2aad4590d080, 
> event=104, data=0x2aade42edae8) at ../iocore/eventsystem/I_Continuation.h:146
> #10 0x00000000007524c3 in read_signal_and_update (event=104, 
> vc=0x2aade42ed9d0) at UnixNetVConnection.cc:138
> #11 0x000000000075261e in read_signal_done (event=104, nh=0x2aac89a53ad0, 
> vc=0x2aade42ed9d0) at UnixNetVConnection.cc:169
> #12 0x0000000000754cd4 in UnixNetVConnection::readSignalDone 
> (this=0x2aade42ed9d0, event=104, nh=0x2aac89a53ad0) at 
> UnixNetVConnection.cc:922
> #13 0x000000000073e088 in SSLNetVConnection::net_read_io 
> (this=0x2aade42ed9d0, nh=0x2aac89a53ad0, lthread=0x2aac89a50010) at 
> SSLNetVConnection.cc:596
> #14 0x000000000074c50d in NetHandler::mainNetEvent (this=0x2aac89a53ad0, 
> event=5, e=0x282fb30) at UnixNet.cc:399
> #15 0x0000000000502eae in Continuation::handleEvent (this=0x2aac89a53ad0, 
> event=5, data=0x282fb30) at ../iocore/eventsystem/I_Continuation.h:146
> #16 0x0000000000773172 in EThread::process_event (this=0x2aac89a50010, 
> e=0x282fb30, calling_code=5) at UnixEThread.cc:144
> #17 0x000000000077367c in EThread::execute (this=0x2aac89a50010) at 
> UnixEThread.cc:268
> #18 0x000000000077272d in spawn_thread_internal (a=0x2e1b740) at Thread.cc:88
> #19 0x00002aabd3d04851 in start_thread () from /lib64/libpthread.so.0
> #20 0x0000003296ee890d in clone () from /lib64/libc.so.6
> {code}
> {code}
> (gdb) frame 1
> #1  0x0000000000750cfe in UnixNetProcessor::connect_re_internal 
> (this=0x1032fc0, cont=0x2aad4590d080, target=0x2aad4590d728, 
> opt=0x2aac8b367600) at UnixNetProcessor.cc:247
> 247   UnixNetProcessor.cc: No such file or directory.
>       in UnixNetProcessor.cc
> (gdb) print mutex
> $28 = (ProxyMutex *) 0x2aadf004d070
> (gdb) print *mutex
> $29 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x77e890}, 
> m_refcount = 16}, the_mutex = {__data = {__lock = 0, __count = 0, __owner = 
> 0, __nusers = 0, __kind = 0, 
>       __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' 
> <repeats 39 times>, __align = 0}, thread_holding = 0x0, nthread_holding = 0}
> (gdb) print t
> $30 = (EThread *) 0x0
> (gdb) print cont
> $31 = (Continuation *) 0x2aad4590d080
> (gdb) print *cont
> $32 = {<force_VFPT_to_top> = {_vptr.force_VFPT_to_top = 0x7aaef0}, handler = 
> (int (Continuation::*)(Continuation *, int, void *)) 0x5de4ce 
> <HttpSM::main_handler(int, void*)>, mutex = {
>     m_ptr = 0x2aadf004d070}, link = {<SLink<Continuation>> = {next = 0x0}, 
> prev = 0x0}}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TS-3266) core dump in UnixNetProcessor::connect_re_internal

Reply via email to