[
https://issues.apache.org/jira/browse/AVRO-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13660028#comment-13660028
]
Doug Cutting commented on AVRO-1292:
------------------------------------
James, is this ready to commit? What kind of testing have you done? Is it
possible to add tests for this?
> NettyTransceiver: Client threads can block under certain connection failure
> scenarios
> -------------------------------------------------------------------------------------
>
> Key: AVRO-1292
> URL: https://issues.apache.org/jira/browse/AVRO-1292
> Project: Avro
> Issue Type: Bug
> Components: java
> Affects Versions: 1.7.4
> Reporter: James Baldassari
> Assignee: James Baldassari
> Labels: avro, ipc, netty
> Attachments: AVRO-1292-Part1.patch, AVRO-1292-Part2.patch,
> AVRO-1292-Part2-v2.patch
>
>
> I've recently found a couple of different failure scenarios with
> NettyTransceiver that result in:
> * Client threads blocking for long periods of time (uninterruptibly at that)
> while holding the {{stateLock}} write lock
> * RPCs (either sync or async) never returning because a failure in sending
> the RPC was not propagated back up to the caller
> The patch I'm going to submit will probably be a lot easier to understand,
> but I'll try to explain the main problems I found. There is a single type of
> underlying connectivity issue that seems to trigger both of these problems in
> NettyTransceiver: a failure at the network layer causes all packets to be
> dropped somewhere between the RPC client and server. You might think this
> would be a rare scenario, but it has happened several times in our production
> environment and usually occurs after the RPC server machine becomes
> unresponsive and needs to be physically rebooted. The only way I've been
> able to reproduce this scenario for testing purposes has been to set up an
> iptables rule on the RPC server that simply drops all incoming packets from
> the client. For example, if the client's IP is 10.0.0.1 I would use the
> following iptables rule on the server to reproduce the failure:
> {code}
> iptables -t mangle -A INPUT --source 10.0.0.1 -j DROP
> {code}
> After looking through a lot of stack traces I think I've identified 2 main
> problems:
> *Problem 1:* NettyTransceiver calls
> {{ChannelFuture#awaitUninterruptibly(long)}} in a couple places,
> {{getChannel()}} and {{disconnect(boolean,boolean,Throwable)}}. Under the
> dropped packet scenario I outlined above, the client thread ends up blocking
> uninterruptibly for the entire connection timeout duration while holding the
> {{stateLock}} write lock. The stack trace for this situation looks like this:
> {code}
> "RPC Executor - 11 - 1363627762930" daemon prio=10 tid=0x00002aaad005f000
> nid=0x56cf in Object.wait() [0x0000000049344000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:443)
> at
> org.jboss.netty.channel.DefaultChannelFuture.await0(DefaultChannelFuture.java:265)
> - locked <0x0000000703acfa00> (a
> org.jboss.netty.channel.DefaultChannelFuture)
> at
> org.jboss.netty.channel.DefaultChannelFuture.awaitUninterruptibly(DefaultChannelFuture.java:237)
> at
> org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:248)
> at
> org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:199)
> at
> org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:148)
> {code}
> At a minimum it should be possible to interrupt these connection attempts.
> *Problem 2:* When an error occurs writing to the Netty channel the error is
> not passed back up the stack or callback chain (whether it's a sync or async
> RPC), so the client can end up waiting indefinitely for an RPC that will
> never return because an error occurred sending the Netty packet (i.e. it was
> never sent to the server in the first place). This scenario might yield a
> stack trace like the following:
> {code}
> "main" prio=10 tid=0x00007f9400008800 nid=0x379b waiting on condition
> [0x00007f9406bc6000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007af677960> (a
> java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
> at org.apache.avro.ipc.CallFuture.await(CallFuture.java:141)
> at org.apache.avro.ipc.Requestor.request(Requestor.java:150)
> at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
> at
> org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
> at $Proxy9.send(Unknown Source)
> {code}
> It's difficult to provide a unit test for these issues because a connection
> refused error alone will not trigger it. The only way I've been able to
> reliably reproduce it is by setting the iptables rule I mentioned above.
> Hopefully a code review will be sufficient, but if necessary I can try to
> find a way to create a unit test.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira