Sushant created FLINK-27855: ------------------------------- Summary: Job Manager fails to recover with S3 storage and HA enabled Key: FLINK-27855 URL: https://issues.apache.org/jira/browse/FLINK-27855 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Sushant
Steps to replicate 1. Enable HA and mention S3 path in flink configuration 2. Create the flink deployment and let it run for sometime to generate checkpoints 3. Delete the flink deployment 4. Recreate once again and job manager doesn’t come up complaining about S3 server cleanup Note that above steps goes through fine if AWS EFS is being used instead of S3 for HA Error Traceback: {code:java} 2022-05-31 16:39:44,332 WARN org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] - Cleanup of BlobServer failed for job 00000000000000000000000000000000 due to a CompletionException: java.io.IOException: java.io.IOException: Error while cleaning up the BlobStore for job 00000000000000000000000000000000 2022-05-31 16:42:56,955 WARN org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Ignoring JobGraph submission (00000000000000000000000000000000) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution. 2022-05-31 16:42:57,026 ERROR [] - Error while processing events : org.apache.flink.util.FlinkException: Failed to execute job at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108) ~[flink-dist-1.15.0.jar:1.15.0] Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job has already been submitted. at org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35) ~[flink-dist-1.15.0.jar:1.15.0] 2022-05-31 16:42:57,130 INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application CANCELED: java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: CANCELED at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389) ~[flink-dist-1.15.0.jar:1.15.0] at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?] Caused by: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: CANCELED at org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71) ~[flink-dist-1.15.0.jar:1.15.0] ... 56 more Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was cancelled. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) ~[flink-dist-1.15.0.jar:1.15.0] at org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60) ~[flink-dist-1.15.0.jar:1.15.0] ... 56 more {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)