Sushant created FLINK-27855:
-------------------------------

             Summary: Job Manager fails to recover with S3 storage and HA 
enabled
                 Key: FLINK-27855
                 URL: https://issues.apache.org/jira/browse/FLINK-27855
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
            Reporter: Sushant


Steps to replicate
1. Enable HA and mention S3 path in flink configuration
2. Create the flink deployment and let it run for sometime to generate 
checkpoints
3. Delete the flink deployment
4. Recreate once again and job manager doesn’t come up complaining about S3 
server cleanup

Note that above steps goes through fine if  AWS EFS is being used instead of S3 
for HA


Error Traceback: 


{code:java}
2022-05-31 16:39:44,332 WARN  
org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] - Cleanup 
of BlobServer failed for job 00000000000000000000000000000000 due to a 
CompletionException: java.io.IOException: java.io.IOException: Error while 
cleaning up the BlobStore for job 00000000000000000000000000000000

2022-05-31 16:42:56,955 WARN  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Ignoring 
JobGraph submission (00000000000000000000000000000000) because the job already 
reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a 
previous execution.
2022-05-31 16:42:57,026 ERROR              [] - Error while processing events :
org.apache.flink.util.FlinkException: Failed to execute job
        at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108)
 ~[flink-dist-1.15.0.jar:1.15.0]
Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job 
has already been submitted.
        at 
org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
 ~[flink-dist-1.15.0.jar:1.15.0]
2022-05-31 16:42:57,130 INFO  
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
[] - Application CANCELED:
java.util.concurrent.CompletionException: 
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: 
Application Status: CANCELED
        at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389)
 ~[flink-dist-1.15.0.jar:1.15.0]
        at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
Caused by: 
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: 
Application Status: CANCELED
        at 
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71)
 ~[flink-dist-1.15.0.jar:1.15.0]
        ... 56 more
Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was 
cancelled.
        at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
 ~[flink-dist-1.15.0.jar:1.15.0]
        at 
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60)
 ~[flink-dist-1.15.0.jar:1.15.0]
        ... 56 more
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to