Dong Lin created FLINK-31681:
--------------------------------
Summary: Network connection timeout between operators should
trigger either network re-connection or job failover
Key: FLINK-31681
URL: https://issues.apache.org/jira/browse/FLINK-31681
Project: Flink
Issue Type: Bug
Reporter: Dong Lin
If a network connection error occurs between two operators, the upstream
operator may log the following error message in the method
PartitionRequestQueue#handleException and subsequently close the connection.
When this happens, the Flink job may become stuck without completing or
failing.
To avoid this issue, we can either allow the upstream operator to reconnect
with the downstream operator, or enable job failover so that users can take
corrective action promptly.
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered
error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors#NativeIOException:
writeAccess(...) failed: Connection timed out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)