NettyTransceiver can hang after server restart
----------------------------------------------

                 Key: AVRO-1013
                 URL: https://issues.apache.org/jira/browse/AVRO-1013
             Project: Avro
          Issue Type: Bug
    Affects Versions: 1.6.1
            Reporter: James Baldassari
            Priority: Blocker


I ran into a very specific scenario today which can lead to NettyTransceiver 
hanging indefinitely:

# Start up a NettyServer
# Initialize a NettyTransceiver and SpecificRequestor
# Execute an RPC to establish the connection/handshake with the server
# Shut down the server
# Immediately execute another RPC

After Step 4, NettyTransceiver will detect that the connection has been closed 
and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 
'remote' to null, indicating to Requestor that the NettyTransceiver is now 
disconnected.  However, if an RPC is executed just after the server has closed 
its socket (Step 5) and before disconnect() has been called, NettyTransceiver 
may still try to send this RPC because 'remote' has not yet been set to null.  
This race condition is normally ok because NettyTransceiver#getChannel() will 
detect that the socket has been closed and then try to reestablish the 
connection.  Unfortunately, in this scenario getChannel() blocks forever when 
it attempts to acquire the write lock because the read lock has been acquired 
twice rather than once as getChannel() expects.  The read lock is acquired once 
by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by 
writeDataPack(NettyDataPack).

The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is 
private) does not acquire the read lock but specifies in its contract that the 
read lock must acquired before calling this method.  This change prevents the 
read lock from being acquired more than once by any single thread.  Another 
change is to have NettyTransceiver#isConnected() perform two checks instead of 
one: remote != null && isChannelReady(channel).  This second change should 
allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to