[ 
https://issues.apache.org/jira/browse/FLINK-39103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39103:
-----------------------------------
    Labels: pull-request-available  (was: )

> BufferManager.recycle fails TM due to RemoteTransportException
> --------------------------------------------------------------
>
>                 Key: FLINK-39103
>                 URL: https://issues.apache.org/jira/browse/FLINK-39103
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 2.3.0
>            Reporter: Roman Khachatryan
>            Assignee: Roman Khachatryan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.3.0
>
>
> I was looking into 
> [this|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=71888&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=9d734c8c-6253-55e6-3bce-47e7cdf68ac4&l=40624]
>  CI failure of EventTimeWindowCheckpointingITCase testSlidingTimeWindow for 
> my PR.
>  
> It looks like the artificial failure on one TM caused a fatal failure in 
> another TM, eventually failing the test with not enough resources:
> {code:java}
> 11:57:54,793 [SlidingEventTimeWindows (4/4)#0] ERROR 
> org.apache.flink.runtime.minicluster.MiniCluster             [] - TaskManager 
> #0 failed.
> java.lang.RuntimeException: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Error at remote task manager 'localhost/127.0.0.1:40903 [ 
> 8b99efa6-f87f-4fba-9d9d-9b8497d249fa ] '.
>       at 
> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321) 
> ~[flink-core-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.partition.consumer.BufferManager.recycle(BufferManager.java:237)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:189)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:151)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:141)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:161)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.clear(SpillingAdaptiveSpanningRecordDeserializer.java:140)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.releaseDeserializer(AbstractStreamTaskNetworkInput.java:328)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.close(AbstractStreamTaskNetworkInput.java:320)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.close(StreamTaskNetworkInput.java:142)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.close(StreamOneInputProcessor.java:88)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInternal(StreamTask.java:1112)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) 
> ~[flink-core-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:83)
>  ~[flink-core-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
>  ~[flink-core-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:1103)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$2(Task.java:972)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:987)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$3(Task.java:972)
>  [flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) 
> ~[flink-core-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:83)
>  ~[flink-core-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
>  ~[flink-core-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:808) 
> [flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) 
> [flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at java.base/java.lang.Thread.run(Thread.java:833) [?:?]
> Caused by: 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Error at remote task manager 'localhost/127.0.0.1:40903 [ 
> 8b99efa6-f87f-4fba-9d9d-9b8497d249fa ] '.
>       at 
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.decodeMsg(CreditBasedPartitionRequestClientHandler.java:333)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelRead(CreditBasedPartitionRequestClientHandler.java:197)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelRead(NettyMessageClientDecoderDelegate.java:112)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1429)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:794)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.handle(AbstractEpollChannel.java:482)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollIoHandler$DefaultEpollIoRegistration.handle(EpollIoHandler.java:317)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollIoHandler.processReady(EpollIoHandler.java:514)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollIoHandler.run(EpollIoHandler.java:459)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadIoEventLoop.runIo(SingleThreadIoEventLoop.java:225)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:196)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1193)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       ... 1 more
> Caused by: 
> org.apache.flink.runtime.io.network.partition.ProducerFailedException: 
> java.lang.Exception: java.lang.Exception: Artificial Failure
>       at 
> org.apache.flink.runtime.io.network.partition.PipelinedSubpartitionView.getFailureCause(PipelinedSubpartitionView.java:96)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.CreditBasedSequenceNumberingViewReader.getFailureCause(CreditBasedSequenceNumberingViewReader.java:282)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:325)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.enqueueAvailableReader(PartitionRequestQueue.java:126)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.addCreditOrResumeConsumption(PartitionRequestQueue.java:175)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:115)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:42)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:356)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1429)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:794)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.handle(AbstractEpollChannel.java:482)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollIoHandler$DefaultEpollIoRegistration.handle(EpollIoHandler.java:317)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollIoHandler.processReady(EpollIoHandler.java:514)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollIoHandler.run(EpollIoHandler.java:459)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadIoEventLoop.runIo(SingleThreadIoEventLoop.java:225)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:196)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1193)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  ~[flink-shaded-netty-4.2.6.Final-21.0.jar:?]
>       ... 1 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.lang.Exception: 
> Artificial Failure
>       at 
> org.apache.flink.test.checkpointing.utils.FailingSource.run(FailingSource.java:111)
>  ~[test-classes/:?]
>       at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:107)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:68)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]
>       at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:346)
>  ~[flink-runtime-2.3-SNAPSHOT.jar:2.3-SNAPSHOT]{code}
> The issue is that BufferManager.recycle rethrows an exception from the input 
> channel (and it is not hanlded).
> I believe this is the cause of many CI instabilities.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to