Deng Liwen created FLINK-33406:
----------------------------------
Summary: Flink Job failed due to losing connection from ZK server
Key: FLINK-33406
URL: https://issues.apache.org/jira/browse/FLINK-33406
Project: Flink
Issue Type: Bug
Components: API / DataStream
Affects Versions: 1.14.3
Environment: Flink version: 1.14.3
Zookeeper version: 3.4.10
Reporter: Deng Liwen
We are using Flink 1.14.3 and we faced an issue when losing connection from ZK
server, the flink job connecting to the target ZK server will be failed
directly. This case can be reproduced 100% when you kill the connected ZK
server for simulating connection refused issue. Flink jobs connect to other
running ZK server keep running as expected. The log output is:
{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: java.util.concurrent.TimeoutException at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246)
at
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1731)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)Caused by:
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at
org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:123)
at
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:80)
at
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1916)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)