[
https://issues.apache.org/jira/browse/FLINK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski closed FLINK-28695.
----------------------------------
Fix Version/s: 1.17.0
1.16.1
1.15.4
Resolution: Fixed
merged commit 6b56505 into apache:release-1.15
merged commit ecede7a into apache:release-1.16
merged e7854193816^^^..e7854193816 into apache:master
Thanks [~fanrui] for fixing, and others for reporting/analyzing and confirming
the bug and workaround.
> Fail to send partition request to restarted taskmanager
> -------------------------------------------------------
>
> Key: FLINK-28695
> URL: https://issues.apache.org/jira/browse/FLINK-28695
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes, Runtime / Network
> Affects Versions: 1.15.0, 1.15.1
> Reporter: Simonas
> Assignee: Rui Fan
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.17.0, 1.16.1, 1.15.4
>
> Attachments: deployment.txt, image-2022-11-20-16-16-45-705.png,
> image-2022-11-21-17-15-58-749.png, job_log.txt, jobmanager_config.txt,
> jobmanager_logs.txt, pod_restart.txt, taskmanager_config.txt
>
>
> After upgrade to *1.15.1* we started getting error while running JOB
>
> {code:java}
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed. at
> org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
> .... {code}
> {code:java}
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
>
> atrg.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> ChannelPromise)(Unknown Source){code}
> After investigation we managed narrow it down to the exact behavior then this
> issue happens:
> # Deploying JOB on fresh kubernetes session cluster with multiple
> TaskManagers: TM1 and TM2 is successful. Job has multiple partitions running
> on both TM1 and TM2.
> # One TaskManager TM2 (XXX.XXX.XX.32) fails for unrelated issue. For example
> OOM exception.
> # Kubernetes POD with mentioned TaskManager TM2 is restarted. POD retains
> same IP address as before.
> # JobManager is able to pickup the restarted TM2 (XXX.XXX.XX.32)
> # JOB is restarted because it was running on the failed TaskManager TM2
> # TM1 data channel to TM2 is closed and we get LocalTransportException:
> Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed during JOB
> running stage.
> # When we explicitly delete pod with TM2 it creates new POD with different
> IP address and JOB is able to start again.
> Important to note that we didn't encountered this issue with previous
> *1.14.4* version and TaskManager restarts didn't cause such error.
> Please note attached kubernetes deployments and reduced logs from JobManager.
> TaskManager logs did show errors before error, but doesn't show anything
> significant after restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)