[jira] [Comment Edited] (FLINK-31681) Network connection timeout between operators should trigger either network re-connection or job failover

Dong Lin (Jira) Fri, 31 Mar 2023 07:27:13 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707281#comment-17707281
 ]


Dong Lin edited comment on FLINK-31681 at 3/31/23 2:25 PM:
-----------------------------------------------------------

This happens with Flink version 1.15.1 when we were testing Flink ML with 
parallelism = 200.

Upgrading the internal Flink library and related connectors needed by Flink ML 
would take some time. Thus we have not tried to reproduce this issue with Flink 
1.17.

Thus I choose to write down the phenomena and the error message in this JIRA to 
make sure this issue will be tracked. I will close this JIRA if we can not 
reproduce the issue with the latest Flink version.


was (Author: lindong):
This happens with Flink version 1.15.1 when we were testing Flink ML with 
parallelism = 200.

Upgrading the internal Flink library and related connectors needed by Flink ML 
would take some time. Thus we have not tried to reproduce this issue with Flink 
1.17.

Thus I choose to write down the phenomenal and the error message in this JIRA 
to make sure this issue will be tracked. I will close this JIRA if we can not 
reproduce the issue with the latest Flink version.

> Network connection timeout between operators should trigger either network 
> re-connection or job failover
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-31681
>                 URL: https://issues.apache.org/jira/browse/FLINK-31681
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Dong Lin
>            Priority: Major
>
> If a network connection error occurs between two operators, the upstream 
> operator may log the following error message in the method 
> PartitionRequestQueue#handleException and subsequently close the connection. 
> When this happens, the Flink job may become stuck without completing or 
> failing. 
> To avoid this issue, we can either allow the upstream operator to reconnect 
> with the downstream operator, or enable job failover so that users can take 
> corrective action promptly.
> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered 
> error while consuming partitions 
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors#NativeIOException:
>  writeAccess(...) failed: Connection timed out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-31681) Network connection timeout between operators should trigger either network re-connection or job failover

Reply via email to