[ 
https://issues.apache.org/jira/browse/TS-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588448#comment-14588448
 ] 

Susan Hinrichs commented on TS-3266:
------------------------------------

I think this is due to an issue of using multiple locks at different points to 
protect the UnixNetVConnection::read.vio data structure

In the crash in SSLNetVConnection:net_read_io (frame #13), the function has 
successfully acquired the lock from the what this->read.vio.mutex was pointing 
at at the time.  Looking at the mutex pointer values in the variable lock, this 
corresponds to the value of nh->mutex.  At the point of the crash 
this->read,vio,mutex has the same value as the HttpSM->mutex in frame 6.

I'm guessing that another thread has acquired the HttpSM lock and updated the 
read.vio data.  Since almost everywhere, the read.vio seems to be protected by 
the HttpSM mutex, not the NetHandler mutex.

Looking more broadly at the state of the cores.  It appears that ATS has 
completed a SSL handshake with the Origin server.  Then the origin server sends 
a EOS, which is the event being processed at the time of the core.

This is the second attempt to connect to the server.  
(HttpSM::t_state.current.attempts == 2).  Presumably the first attempt failed, 
and we are now launching into our third attempt (or perhaps we are launching 
into our second attempt).  This explains the server_entry == NULL that Sudheer 
noticed.  The server_entry is set to NULL as we go up the stack to clean up the 
failed connection attempt.

At first I blamed the cases where we pass NULL into the continuation for the 
do_io_read, but I no longer think that is the case.  If the continuation is 
NULL, the netVC->mutex is used.  And at this point, the netvc->mutex looks like 
the HttpSM mutex not the net handler mutex.  I cannot find the scenario where 
the net handler's mutex is ever used in setting up a read vio.  Still looking.

> core dump in UnixNetProcessor::connect_re_internal
> --------------------------------------------------
>
>                 Key: TS-3266
>                 URL: https://issues.apache.org/jira/browse/TS-3266
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 5.2.0, 5.3.0
>            Reporter: Sudheer Vinukonda
>            Assignee: Susan Hinrichs
>              Labels: crash
>         Attachments: ts-3266.diff
>
>
> See a new core dump in v5.2.0 after running stable for over 48 hours. Below 
> is the bt and some gdb info.
> {code}
> (gdb) bt
> #0  0x0000000000773056 in EThread::is_event_type (this=0x0, et=2) at 
> UnixEThread.cc:121
> #1  0x0000000000750cfe in UnixNetProcessor::connect_re_internal 
> (this=0x1032fc0, cont=0x2aad4590d080, target=0x2aad4590d728, 
> opt=0x2aac8b367600) at UnixNetProcessor.cc:247
> #2  0x000000000052b498 in NetProcessor::connect_re (this=0x1032fc0, 
> cont=0x2aad4590d080, addr=0x2aad4590d728, opts=0x2aac8b367600) at 
> ../iocore/net/P_UnixNetProcessor.h:85
> #3  0x00000000005e64e5 in HttpSM::do_http_server_open (this=0x2aad4590d080, 
> raw=false) at HttpSM.cc:4796
> #4  0x00000000005edec7 in HttpSM::set_next_state (this=0x2aad4590d080) at 
> HttpSM.cc:7141
> #5  0x00000000005ed2f2 in HttpSM::call_transact_and_set_next_state 
> (this=0x2aad4590d080, f=0x607320 
> <HttpTransact::HandleResponse(HttpTransact::State*)>) at HttpSM.cc:6961
> #6  0x00000000005e7b72 in HttpSM::handle_server_setup_error 
> (this=0x2aad4590d080, event=104, data=0x2aade42edae8) at HttpSM.cc:5308
> #7  0x00000000005dc57c in HttpSM::state_send_server_request_header 
> (this=0x2aad4590d080, event=104, data=0x2aade42edae8) at HttpSM.cc:1989
> #8  0x00000000005de6a2 in HttpSM::main_handler (this=0x2aad4590d080, 
> event=104, data=0x2aade42edae8) at HttpSM.cc:2570
> #9  0x0000000000502eae in Continuation::handleEvent (this=0x2aad4590d080, 
> event=104, data=0x2aade42edae8) at ../iocore/eventsystem/I_Continuation.h:146
> #10 0x00000000007524c3 in read_signal_and_update (event=104, 
> vc=0x2aade42ed9d0) at UnixNetVConnection.cc:138
> #11 0x000000000075261e in read_signal_done (event=104, nh=0x2aac89a53ad0, 
> vc=0x2aade42ed9d0) at UnixNetVConnection.cc:169
> #12 0x0000000000754cd4 in UnixNetVConnection::readSignalDone 
> (this=0x2aade42ed9d0, event=104, nh=0x2aac89a53ad0) at 
> UnixNetVConnection.cc:922
> #13 0x000000000073e088 in SSLNetVConnection::net_read_io 
> (this=0x2aade42ed9d0, nh=0x2aac89a53ad0, lthread=0x2aac89a50010) at 
> SSLNetVConnection.cc:596
> #14 0x000000000074c50d in NetHandler::mainNetEvent (this=0x2aac89a53ad0, 
> event=5, e=0x282fb30) at UnixNet.cc:399
> #15 0x0000000000502eae in Continuation::handleEvent (this=0x2aac89a53ad0, 
> event=5, data=0x282fb30) at ../iocore/eventsystem/I_Continuation.h:146
> #16 0x0000000000773172 in EThread::process_event (this=0x2aac89a50010, 
> e=0x282fb30, calling_code=5) at UnixEThread.cc:144
> #17 0x000000000077367c in EThread::execute (this=0x2aac89a50010) at 
> UnixEThread.cc:268
> #18 0x000000000077272d in spawn_thread_internal (a=0x2e1b740) at Thread.cc:88
> #19 0x00002aabd3d04851 in start_thread () from /lib64/libpthread.so.0
> #20 0x0000003296ee890d in clone () from /lib64/libc.so.6
> {code}
> {code}
> (gdb) frame 1
> #1  0x0000000000750cfe in UnixNetProcessor::connect_re_internal 
> (this=0x1032fc0, cont=0x2aad4590d080, target=0x2aad4590d728, 
> opt=0x2aac8b367600) at UnixNetProcessor.cc:247
> 247   UnixNetProcessor.cc: No such file or directory.
>       in UnixNetProcessor.cc
> (gdb) print mutex
> $28 = (ProxyMutex *) 0x2aadf004d070
> (gdb) print *mutex
> $29 = {<RefCountObj> = {<ForceVFPTToTop> = {_vptr.ForceVFPTToTop = 0x77e890}, 
> m_refcount = 16}, the_mutex = {__data = {__lock = 0, __count = 0, __owner = 
> 0, __nusers = 0, __kind = 0, 
>       __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' 
> <repeats 39 times>, __align = 0}, thread_holding = 0x0, nthread_holding = 0}
> (gdb) print t
> $30 = (EThread *) 0x0
> (gdb) print cont
> $31 = (Continuation *) 0x2aad4590d080
> (gdb) print *cont
> $32 = {<force_VFPT_to_top> = {_vptr.force_VFPT_to_top = 0x7aaef0}, handler = 
> (int (Continuation::*)(Continuation *, int, void *)) 0x5de4ce 
> <HttpSM::main_handler(int, void*)>, mutex = {
>     m_ptr = 0x2aadf004d070}, link = {<SLink<Continuation>> = {next = 0x0}, 
> prev = 0x0}}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to