[
https://issues.apache.org/jira/browse/FLINK-31681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Piotr Nowojski updated FLINK-31681:
-----------------------------------
Affects Version/s: 1.15.1
> Network connection timeout between operators should trigger either network
> re-connection or job failover
> --------------------------------------------------------------------------------------------------------
>
> Key: FLINK-31681
> URL: https://issues.apache.org/jira/browse/FLINK-31681
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.15.1
> Reporter: Dong Lin
> Priority: Major
>
> If a network connection error occurs between two operators, the upstream
> operator may log the following error message in the method
> PartitionRequestQueue#handleException and subsequently close the connection.
> When this happens, the Flink job may become stuck without completing or
> failing.
> To avoid this issue, we can either allow the upstream operator to reconnect
> with the downstream operator, or enable job failover so that users can take
> corrective action promptly.
> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered
> error while consuming partitions
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors#NativeIOException:
> writeAccess(...) failed: Connection timed out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)