[ 
https://issues.apache.org/jira/browse/AVRO-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gareth Davis updated AVRO-1407:
-------------------------------
    Attachment: AVRO-1407-testcase.patch

test case that demonstrates the issue.

Applying this shows the issue. It can be made to show it really clearly if you 
comment out the serverSocket.close() as the failing test will never terminate.

applying the actual code patch fixes the test.


> NettyTransceiver can cause a infinite loop when slow to connect
> ---------------------------------------------------------------
>
>                 Key: AVRO-1407
>                 URL: https://issues.apache.org/jira/browse/AVRO-1407
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.5, 1.7.6
>            Reporter: Gareth Davis
>            Assignee: Gareth Davis
>             Fix For: 1.7.8
>
>         Attachments: AVRO-1407-1.patch, AVRO-1407-testcase.patch
>
>
> When a new {{NettyTransceiver}} is created it forces the channel to be 
> allocated and connected to the remote host. it waits for the connectTimeout 
> ms on the [connect channel 
> future|https://github.com/apache/avro/blob/1579ab1ac95731630af58fc303a07c9bf28541d6/lang/java/ipc/src/main/java/org/apache/avro/ipc/NettyTransceiver.java#L271]
>  this is obivously a good thing it's only that on being unsuccessful, ie 
> {{!channelFuture.isSuccess()}} an exception is thrown and the call to the 
> constructor fails with an {{IOException}}, but has the potential to leave a 
> active channel associated with the {{ChannelFactory}}
> The problem is that a Netty {{NioClientSocketChannelFactory}} will not 
> shutdown if there are active channels still around and if you have supplied 
> the {{ChannelFactory}} to the {{NettyTransceiver}} then  you will not be able 
> to cancel it by calling {{ChannelFactory.releaseExternalResources()}} like 
> the [Flume Avro RPC client 
> does|https://github.com/apache/flume/blob/b8cf789b8509b1e5be05dd0b0b16c5d9af9698ae/flume-ng-sdk/src/main/java/org/apache/flume/api/NettyAvroRpcClient.java#L158].
>  In order to recreate this you need a very laggy network, where the connect 
> attempt takes longer than the connect timeout but does actually work, this 
> very hard to organise in a test case, although I do have a test setup using 
> vagrant VM's that recreates this everytime, using the Flume RPC client and 
> server.
> The following stack is from a production system, it won't ever leave recover 
> until the channel is disconnected (by forcing a disconnect at the remote 
> host) or restarting the JVM.
> {noformat:title=Production stack trace}
> "TLOG-0" daemon prio=10 tid=0x00007f581c7be800 nid=0x39a1 waiting on 
> condition [0x00007f57ef9f2000]
>   java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   parking to wait for <0x00000007218b16e0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
>   at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
>   at 
> org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:103)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.releaseExternalResources(AbstractNioWorkerPool.java:80)
>   at 
> org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:181)
>   at 
> org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:142)
>   at 
> org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:101)
>   at 
> org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:564)
>   locked <0x00000006c30ae7b0> (a org.apache.flume.api.NettyAvroRpcClient)
>   at 
> org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
>   at 
> org.apache.flume.api.LoadBalancingRpcClient.createClient(LoadBalancingRpcClient.java:214)
>   at 
> org.apache.flume.api.LoadBalancingRpcClient.getClient(LoadBalancingRpcClient.java:205)
>   locked <0x00000006a97b18e8> (a org.apache.flume.api.LoadBalancingRpcClient)
>   at 
> org.apache.flume.api.LoadBalancingRpcClient.appendBatch(LoadBalancingRpcClient.java:95)
>   at 
> com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:45)
>   at 
> com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:43)
> {noformat}
> The solution is very simple, and a patch should be along in a moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to