[ 
https://issues.apache.org/jira/browse/FLINK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230430#comment-17230430
 ] 

Arvid Heise commented on FLINK-19925:
-------------------------------------

So for an unstable connection, there is not much that we can do except rely on 
the normal recovery logic (which may fail tests that limit that for good 
reasons). However, the root cause of the failure is not exactly clear to me.
{code:java}
try {
   Channel channel = 
nettyClient.connect(connectionId.getAddress()).await().channel();
   NetworkClientHandler clientHandler = 
channel.pipeline().get(NetworkClientHandler.class); // <-- this is null
   return new NettyPartitionRequestClient(channel, clientHandler, connectionId, 
this); // <-- null check in ctor fails here
} catch (InterruptedException e) {
   throw e;
} catch (Exception e) {
   throw new RemoteTransportException(
      "Connecting to remote task manager '" + connectionId.getAddress() +
         "' has failed. This might indicate that the remote task " +
         "manager has been lost.",
      connectionId.getAddress(), e);
}
{code}
So I'd expect {{nettyClient.connect}} to either fail or succeed. If it's 
failing we get the message and I think this is as good as it gets. But if it's 
succeeding (as in the stacktrace of Robert), then I'd expect theĀ 
{{clientHandler}} to be non-null. So it seems fishy, but I have not looked into 
netty deep enough to find out what's wrong. [~kevin.cyj] could you check if you 
find a good reason for a null value?

A fix for this issue could also just be a better error message or some comment 
in the class to indicate why it's null.

In any case, I'd suggest to lower priority.

> Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
> --------------------------------------------------------------------------
>
>                 Key: FLINK-19925
>                 URL: https://issues.apache.org/jira/browse/FLINK-19925
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.12.0
>            Reporter: godfrey he
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.12.0
>
>
> Errors$NativeIoException will occur sometime when we run TPCDS based on 
> master, the full exception stack is 
> {code:java}
> Caused by: 
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> readAddress(..) failed: Connection reset by peer (connection to 'xxx')
>       at 
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:173)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
>       at java.lang.Thread.run(Thread.java:834) ~[?:1.8.0_102]
> Caused by: 
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
>  readAddress(..) failed: Connection reset by peer
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to