[
https://issues.apache.org/jira/browse/FLINK-19925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229211#comment-17229211
]
Robert Metzger commented on FLINK-19925:
----------------------------------------
I also noticed such an exception in a CI run of the {{Local recovery and sticky
scheduling end-to-end test}} (which, due to another bug (FLINK-19882) was not
reported as a test failure)
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=9389&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529
{code}
2020-11-10T12:19:33.7135577Z Nov 10 12:19:33 2020-11-10 12:19:26,076 WARN
akka.remote.ReliableDeliverySupervisor [] - Association
with remote system [akka.tcp://[email protected]:42025] has failed, address is now
gated for [50] ms. Reason: [Disassociated]
2020-11-10T12:19:33.7137278Z Nov 10 12:19:33 2020-11-10 12:19:26,078 WARN
akka.remote.ReliableDeliverySupervisor [] - Association
with remote system [akka.tcp://[email protected]:39283] has failed,
address is now gated for [50] ms. Reason: [Disassociated]
2020-11-10T12:19:33.7138935Z Nov 10 12:19:33 2020-11-10 12:19:26,305 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat Map ->
Sink: Unnamed (2/4) (3bd0efab238877e710563ad4f49f87a0) switched from RUNNING to
FAILED on
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@7d7cbe96.
2020-11-10T12:19:33.7140369Z Nov 10 12:19:33
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection reset by peer (connection to
'10.1.0.4/10.1.0.4:43007')
2020-11-10T12:19:33.7141721Z Nov 10 12:19:33 at
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:173)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7143118Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7144513Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7145905Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7147250Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7148622Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7149987Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7151331Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7152923Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7154494Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7155806Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7157051Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7158290Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7160165Z Nov 10 12:19:33 at
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
2020-11-10T12:19:33.7160852Z Nov 10 12:19:33 at
java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_272]
2020-11-10T12:19:33.7161480Z Nov 10 12:19:33 Caused by:
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer
2020-11-10T12:19:33.7162958Z Nov 10 12:19:33 2020-11-10 12:19:26,366 DEBUG
org.apache.flink.runtime.scheduler.SharedSlot [] - Remove
logical slot (SlotRequestId{8e34fc6893cd64b17fb0125192c96e18}) for execution
vertex (id 20ba6b65f97481d5570070de90e4e791_1) from the physical slot
(SlotRequestId{ac67d850d673c4ceb476a5fb5cd40cba})
{code}
> Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
> --------------------------------------------------------------------------
>
> Key: FLINK-19925
> URL: https://issues.apache.org/jira/browse/FLINK-19925
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.12.0
> Reporter: godfrey he
> Priority: Major
> Fix For: 1.12.0
>
>
> Errors$NativeIoException will occur sometime when we run TPCDS based on
> master, the full exception stack is
> {code:java}
> Caused by:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> readAddress(..) failed: Connection reset by peer (connection to 'xxx')
> at
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:173)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at java.lang.Thread.run(Thread.java:834) ~[?:1.8.0_102]
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
> readAddress(..) failed: Connection reset by peer
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)