[
https://issues.apache.org/jira/browse/FLINK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692890#comment-16692890
]
Qi commented on FLINK-10941:
----------------------------
[~till.rohrmann] mentioned that increasing the slot idle timeout:
`slot.idle.timeout: 1800000` can serve as a temporary workaround.
[~zjwang] per your previous email [1], an external shuffle service may help
resolve this issue. Could you share more updates on that? Or do you happen to
know any other solution to this issue? Thank you!
[1]
http://mail-archives.apache.org/mod_mbox/flink-dev/201808.mbox/%3c7605d978-d22b-4f5d-a831-89d7d59c5fdd.wangzhijiang...@aliyun.com%3E
> Slots prematurely released which still contain unconsumed data
> ---------------------------------------------------------------
>
> Key: FLINK-10941
> URL: https://issues.apache.org/jira/browse/FLINK-10941
> Project: Flink
> Issue Type: Bug
> Components: ResourceManager
> Affects Versions: 1.5.5
> Reporter: Qi
> Priority: Major
>
> Our case is: Flink 1.5 batch mode, 32 parallelism to read data source and 4
> parallelism to write data sink.
>
> The read task worked perfectly with 32 TMs. However when the job was
> executing the write task, since only 4 TMs were needed, other 28 TMs were
> released. This caused RemoteTransportException in the write task:
>
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager
> ’the_previous_TM_used_by_read_task'. This might indicate that the remote task
> manager was lost.
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:133)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
> ...
>
> After skimming YarnFlinkResourceManager related code, it seems to me that
> Flink is releasing TMs when they’re idle, regardless of whether working TMs
> need them.
>
> Put in another way, Flink seems to prematurely release slots which contain
> unconsumed data and, thus, eventually release a TM which then fails a
> consuming task.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)