[
https://issues.apache.org/jira/browse/FLINK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635947#comment-17635947
]
Piotr Nowojski commented on FLINK-28695:
----------------------------------------
I've assigned the ticket to you [~fanrui]. Please ping me/request a review from
me once you have some fix ready.
In the meantime can someone, who can reproduce this issue, try out if setting
{{`taskmanager.network.max-num-tcp-connections`}} to a very high number work
arounds the problem?
> Fail to send partition request to restarted taskmanager
> -------------------------------------------------------
>
> Key: FLINK-28695
> URL: https://issues.apache.org/jira/browse/FLINK-28695
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes, Runtime / Network
> Affects Versions: 1.15.0, 1.15.1
> Reporter: Simonas
> Assignee: Rui Fan
> Priority: Critical
> Attachments: deployment.txt, job_log.txt, jobmanager_config.txt,
> jobmanager_logs.txt, pod_restart.txt, taskmanager_config.txt
>
>
> After upgrade to *1.15.1* we started getting error while running JOB
>
> {code:java}
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed. at
> org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
> .... {code}
> {code:java}
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
>
> atrg.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> ChannelPromise)(Unknown Source){code}
> After investigation we managed narrow it down to the exact behavior then this
> issue happens:
> # Deploying JOB on fresh kubernetes session cluster with multiple
> TaskManagers: TM1 and TM2 is successful. Job has multiple partitions running
> on both TM1 and TM2.
> # One TaskManager TM2 (XXX.XXX.XX.32) fails for unrelated issue. For example
> OOM exception.
> # Kubernetes POD with mentioned TaskManager TM2 is restarted. POD retains
> same IP address as before.
> # JobManager is able to pickup the restarted TM2 (XXX.XXX.XX.32)
> # JOB is restarted because it was running on the failed TaskManager TM2
> # TM1 data channel to TM2 is closed and we get LocalTransportException:
> Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed during JOB
> running stage.
> # When we explicitly delete pod with TM2 it creates new POD with different
> IP address and JOB is able to start again.
> Important to note that we didn't encountered this issue with previous
> *1.14.4* version and TaskManager restarts didn't cause such error.
> Please note attached kubernetes deployments and reduced logs from JobManager.
> TaskManager logs did show errors before error, but doesn't show anything
> significant after restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)