[ 
https://issues.apache.org/jira/browse/FLINK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushant resolved FLINK-27855.
-----------------------------
    Resolution: Fixed

> Job Manager fails to recover with S3 storage and HA enabled
> -----------------------------------------------------------
>
>                 Key: FLINK-27855
>                 URL: https://issues.apache.org/jira/browse/FLINK-27855
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Sushant
>            Priority: Minor
>
> Flink version: 1.15  with Native Integration K8s operator mode: 
> https://github.com/apache/flink-kubernetes-operator
> Steps to replicate
> 1. Enable HA and mention S3 recovery path in flink configuration property: 
> high-availability.storageDir
> 2. Create the flink application deployment and let it run for some time to 
> generate checkpoints
> 3. Delete the flink application deployment
> 4. Recreate once again and the job manager pod doesn’t come up complaining 
> about S3 recovery cleanup, error is described below
> Note that the above steps go through fine if AWS EFS is being used instead of 
> S3 for HA
> Error Traceback:
> {code:java}
> 2022-05-31 16:39:44,332 WARN  
> org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] - 
> Cleanup of BlobServer failed for job 00000000000000000000000000000000 due to 
> a CompletionException: java.io.IOException: java.io.IOException: Error while 
> cleaning up the BlobStore for job 00000000000000000000000000000000
> 2022-05-31 16:42:56,955 WARN  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Ignoring 
> JobGraph submission (00000000000000000000000000000000) because the job 
> already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) 
> in a previous execution.
> 2022-05-31 16:42:57,026 ERROR              [] - Error while processing events 
> :
> org.apache.flink.util.FlinkException: Failed to execute job
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108)
>  ~[flink-dist-1.15.0.jar:1.15.0]
> Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException: 
> Job has already been submitted.
>       at 
> org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
>  ~[flink-dist-1.15.0.jar:1.15.0]
> 2022-05-31 16:42:57,130 INFO  
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
> [] - Application CANCELED:
> java.util.concurrent.CompletionException: 
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
>  Application Status: CANCELED
>       at 
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389)
>  ~[flink-dist-1.15.0.jar:1.15.0]
>       at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
> Caused by: 
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
>  Application Status: CANCELED
>       at 
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71)
>  ~[flink-dist-1.15.0.jar:1.15.0]
>       ... 56 more
> Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was 
> cancelled.
>       at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>  ~[flink-dist-1.15.0.jar:1.15.0]
>       at 
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60)
>  ~[flink-dist-1.15.0.jar:1.15.0]
>       ... 56 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to