[ 
https://issues.apache.org/jira/browse/AVRO-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660028#comment-13660028
 ] 

Doug Cutting commented on AVRO-1292:
------------------------------------

James, is this ready to commit?  What kind of testing have you done?  Is it 
possible to add tests for this?
                
> NettyTransceiver: Client threads can block under certain connection failure 
> scenarios
> -------------------------------------------------------------------------------------
>
>                 Key: AVRO-1292
>                 URL: https://issues.apache.org/jira/browse/AVRO-1292
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.4
>            Reporter: James Baldassari
>            Assignee: James Baldassari
>              Labels: avro, ipc, netty
>         Attachments: AVRO-1292-Part1.patch, AVRO-1292-Part2.patch, 
> AVRO-1292-Part2-v2.patch
>
>
> I've recently found a couple of different failure scenarios with 
> NettyTransceiver that result in:
> * Client threads blocking for long periods of time (uninterruptibly at that) 
> while holding the {{stateLock}} write lock
> * RPCs (either sync or async) never returning because a failure in sending 
> the RPC was not propagated back up to the caller
> The patch I'm going to submit will probably be a lot easier to understand, 
> but I'll try to explain the main problems I found.  There is a single type of 
> underlying connectivity issue that seems to trigger both of these problems in 
> NettyTransceiver: a failure at the network layer causes all packets to be 
> dropped somewhere between the RPC client and server.  You might think this 
> would be a rare scenario, but it has happened several times in our production 
> environment and usually occurs after the RPC server machine becomes 
> unresponsive and needs to be physically rebooted.  The only way I've been 
> able to reproduce this scenario for testing purposes has been to set up an 
> iptables rule on the RPC server that simply drops all incoming packets from 
> the client.  For example, if the client's IP is 10.0.0.1 I would use the 
> following iptables rule on the server to reproduce the failure:
> {code}
> iptables -t mangle -A INPUT --source 10.0.0.1 -j DROP
> {code}
> After looking through a lot of stack traces I think I've identified 2 main 
> problems:
> *Problem 1:* NettyTransceiver calls 
> {{ChannelFuture#awaitUninterruptibly(long)}} in a couple places, 
> {{getChannel()}} and {{disconnect(boolean,boolean,Throwable)}}.  Under the 
> dropped packet scenario I outlined above, the client thread ends up blocking 
> uninterruptibly for the entire connection timeout duration while holding the 
> {{stateLock}} write lock.  The stack trace for this situation looks like this:
> {code}
> "RPC Executor - 11 - 1363627762930" daemon prio=10 tid=0x00002aaad005f000 
> nid=0x56cf in Object.wait() [0x0000000049344000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:443)
>         at 
> org.jboss.netty.channel.DefaultChannelFuture.await0(DefaultChannelFuture.java:265)
>         - locked <0x0000000703acfa00> (a 
> org.jboss.netty.channel.DefaultChannelFuture)
>         at 
> org.jboss.netty.channel.DefaultChannelFuture.awaitUninterruptibly(DefaultChannelFuture.java:237)
>         at 
> org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:248)
>         at 
> org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:199)
>         at 
> org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:148)
> {code}
> At a minimum it should be possible to interrupt these connection attempts.
> *Problem 2:* When an error occurs writing to the Netty channel the error is 
> not passed back up the stack or callback chain (whether it's a sync or async 
> RPC), so the client can end up waiting indefinitely for an RPC that will 
> never return because an error occurred sending the Netty packet (i.e. it was 
> never sent to the server in the first place).  This scenario might yield a 
> stack trace like the following:
> {code}
> "main" prio=10 tid=0x00007f9400008800 nid=0x379b waiting on condition 
> [0x00007f9406bc6000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000007af677960> (a 
> java.util.concurrent.CountDownLatch$Sync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>         at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
>         at org.apache.avro.ipc.CallFuture.await(CallFuture.java:141)
>         at org.apache.avro.ipc.Requestor.request(Requestor.java:150)
>         at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
>         at 
> org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
>         at $Proxy9.send(Unknown Source)
> {code}
> It's difficult to provide a unit test for these issues because a connection 
> refused error alone will not trigger it.  The only way I've been able to 
> reliably reproduce it is by setting the iptables rule I mentioned above.  
> Hopefully a code review will be sufficient, but if necessary I can try to 
> find a way to create a unit test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to