> On Nov. 19, 2013, 8:08 p.m., Andrew Stitcher wrote:
> > I don't think the locks can deadlock, so from that point of view they are 
> > ok theoretically.
> > 
> > But there are some very important open questions in my mind:
> > 1. Why there are multiple callers of close and is there a good reason for 
> > it.
> > 2. Why don't we see crashes under Linux too, and does this fix work there?
> > 3. Also this looks like cut'n'paste programming from code elsewhere in the 
> > tree - are you sure that the original doesn't also have the bug (and if not 
> > why not?)

1. I can see two hypothetical reasons for the problem. The first is that the 
disconnected callback is called by the windows IO layer when the remote peer - 
i.e. the broker - closes the socket, whereas on linux the eof callback is 
called. If, as seems to be the case for linux, a disconnected callback is 
always followed by a closed callback, then socketClosed(), and therefore 
queueForDeletion() would be called twice. The second possibility is 
disconnection by the remote peer concurrent with a close request from the 
application. Certainly the code as it is now (prior to this patch) has a race 
condition in it that doesn't handle that concurrency adequately. The patch here 
addresses both of these hypothetical problems. Some extra tracing could shed 
some light in which (if either) of these is happening.

2. The fix does work on linux. My guess as to why we don't see the same crashes 
is due to differences in the underlying AsynchIO implementations. Perhaps its 
as simple as timing related. It does seem that on linux the disconnected event 
isn't called, at least in the common cases. Perhaps the posix implementation is 
more tolerant of double calls to queueForDeletion()? 

3. The code here is largely copied from qpid::client::TcpConnector, differing 
mainly in the encode and decode. There is some other locking in that class that 
would I think prevent the race. The same concerns around the disconnected 
callback handler apply, but I'm wondering whether that is only ever actually 
called in response to abort()?.


- Gordon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15677/#review29132
-----------------------------------------------------------


On Nov. 19, 2013, 7:53 p.m., Chug Rolke wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15677/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2013, 7:53 p.m.)
> 
> 
> Review request for qpid, Andrew Stitcher and Steve Huston.
> 
> 
> Bugs: QPID-5363
>     https://issues.apache.org/jira/browse/QPID-5363
> 
> 
> Repository: qpid
> 
> 
> Description
> -------
> 
> Windows fails with an access violation about 10% of the time while closing 
> AMQP 1.0 connections:
> 
> STACK
> =====
> >        qpidmessagingd.dll!qpid::messaging::amqp::TcpTransport::close()  
> > Line 123 + 0x10 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionContext::close() 
>  Line 134 + 0x29 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionHandle::close()  
> Line 62        C++
>          qpidmessagingd.dll!qpid::messaging::Connection::close()  Line 78 + 
> 0x24 bytes        C++
>          hello_world.exe!main(int argc=4, char * * argv=0x0053a688)  Line 51 
> + 0xe bytes        C++
>          hello_world.exe!__tmainCRTStartup()  Line 586 + 0x19 bytes        C
>          hello_world.exe!mainCRTStartup()  Line 403        C
> 
> CODE
> ====
> void TcpTransport::close()
> {
>     QPID_LOG(debug, id << " TcpTransport closing...");
>     if (aio)
>         aio->queueWriteClose();    <=======
> }
> 
> FAILURE
> =======
> aio's vftable is all 0xdddddddd, indicating that it has been deleted.
> 
> THE FIX
> =======
> Use some locks to protect closing the aio object.
> 
> DISCUSSION
> ==========
> The locks in the diff appear to work ok in that the code passes thousands of 
> runs. Are they OK theoretically?
> 
> Include in 0.26 release
> -----------------------
> Please indicate your approval or not in QPID-5363.
> 
> 
> Diffs
> -----
> 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.h 1543532 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.cpp 1543532 
> 
> Diff: https://reviews.apache.org/r/15677/diff/
> 
> 
> Testing
> -------
> 
> Running HelloWorld in a loop in multiple windows works ok.
> 
> Passes unit tests
> 
> 
> Thanks,
> 
> Chug Rolke
> 
>

Reply via email to