[
https://issues.apache.org/jira/browse/FLINK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sushant resolved FLINK-27855.
-----------------------------
Resolution: Fixed
> Job Manager fails to recover with S3 storage and HA enabled
> -----------------------------------------------------------
>
> Key: FLINK-27855
> URL: https://issues.apache.org/jira/browse/FLINK-27855
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Sushant
> Priority: Minor
>
> Flink version: 1.15 with Native Integration K8s operator mode:
> https://github.com/apache/flink-kubernetes-operator
> Steps to replicate
> 1. Enable HA and mention S3 recovery path in flink configuration property:
> high-availability.storageDir
> 2. Create the flink application deployment and let it run for some time to
> generate checkpoints
> 3. Delete the flink application deployment
> 4. Recreate once again and the job manager pod doesn’t come up complaining
> about S3 recovery cleanup, error is described below
> Note that the above steps go through fine if AWS EFS is being used instead of
> S3 for HA
> Error Traceback:
> {code:java}
> 2022-05-31 16:39:44,332 WARN
> org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] -
> Cleanup of BlobServer failed for job 00000000000000000000000000000000 due to
> a CompletionException: java.io.IOException: java.io.IOException: Error while
> cleaning up the BlobStore for job 00000000000000000000000000000000
> 2022-05-31 16:42:56,955 WARN
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Ignoring
> JobGraph submission (00000000000000000000000000000000) because the job
> already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED)
> in a previous execution.
> 2022-05-31 16:42:57,026 ERROR [] - Error while processing events
> :
> org.apache.flink.util.FlinkException: Failed to execute job
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108)
> ~[flink-dist-1.15.0.jar:1.15.0]
> Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException:
> Job has already been submitted.
> at
> org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
> ~[flink-dist-1.15.0.jar:1.15.0]
> 2022-05-31 16:42:57,130 INFO
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap
> [] - Application CANCELED:
> java.util.concurrent.CompletionException:
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
> Application Status: CANCELED
> at
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
> Caused by:
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
> Application Status: CANCELED
> at
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71)
> ~[flink-dist-1.15.0.jar:1.15.0]
> ... 56 more
> Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was
> cancelled.
> at
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60)
> ~[flink-dist-1.15.0.jar:1.15.0]
> ... 56 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)