[
https://issues.apache.org/jira/browse/FLINK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sushant updated FLINK-27855:
----------------------------
Description:
Flink version: 1.15
Steps to replicate
1. Enable HA and mention S3 path in flink configuration
2. Create the flink deployment and let it run for sometime to generate
checkpoints
3. Delete the flink deployment
4. Recreate once again and job manager doesn’t come up complaining about S3
server cleanup
Note that above steps goes through fine if AWS EFS is being used instead of S3
for HA
Error Traceback:
{code:java}
2022-05-31 16:39:44,332 WARN
org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] - Cleanup
of BlobServer failed for job 00000000000000000000000000000000 due to a
CompletionException: java.io.IOException: java.io.IOException: Error while
cleaning up the BlobStore for job 00000000000000000000000000000000
2022-05-31 16:42:56,955 WARN
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Ignoring
JobGraph submission (00000000000000000000000000000000) because the job already
reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a
previous execution.
2022-05-31 16:42:57,026 ERROR [] - Error while processing events :
org.apache.flink.util.FlinkException: Failed to execute job
at
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108)
~[flink-dist-1.15.0.jar:1.15.0]
Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job
has already been submitted.
at
org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
~[flink-dist-1.15.0.jar:1.15.0]
2022-05-31 16:42:57,130 INFO
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap
[] - Application CANCELED:
java.util.concurrent.CompletionException:
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
Application Status: CANCELED
at
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389)
~[flink-dist-1.15.0.jar:1.15.0]
at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
Caused by:
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
Application Status: CANCELED
at
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71)
~[flink-dist-1.15.0.jar:1.15.0]
... 56 more
Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was
cancelled.
at
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
~[flink-dist-1.15.0.jar:1.15.0]
at
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60)
~[flink-dist-1.15.0.jar:1.15.0]
... 56 more
{code}
was:
Steps to replicate
1. Enable HA and mention S3 path in flink configuration
2. Create the flink deployment and let it run for sometime to generate
checkpoints
3. Delete the flink deployment
4. Recreate once again and job manager doesn’t come up complaining about S3
server cleanup
Note that above steps goes through fine if AWS EFS is being used instead of S3
for HA
Error Traceback:
{code:java}
2022-05-31 16:39:44,332 WARN
org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] - Cleanup
of BlobServer failed for job 00000000000000000000000000000000 due to a
CompletionException: java.io.IOException: java.io.IOException: Error while
cleaning up the BlobStore for job 00000000000000000000000000000000
2022-05-31 16:42:56,955 WARN
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Ignoring
JobGraph submission (00000000000000000000000000000000) because the job already
reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a
previous execution.
2022-05-31 16:42:57,026 ERROR [] - Error while processing events :
org.apache.flink.util.FlinkException: Failed to execute job
at
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108)
~[flink-dist-1.15.0.jar:1.15.0]
Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job
has already been submitted.
at
org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
~[flink-dist-1.15.0.jar:1.15.0]
2022-05-31 16:42:57,130 INFO
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap
[] - Application CANCELED:
java.util.concurrent.CompletionException:
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
Application Status: CANCELED
at
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389)
~[flink-dist-1.15.0.jar:1.15.0]
at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
Caused by:
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
Application Status: CANCELED
at
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71)
~[flink-dist-1.15.0.jar:1.15.0]
... 56 more
Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was
cancelled.
at
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
~[flink-dist-1.15.0.jar:1.15.0]
at
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60)
~[flink-dist-1.15.0.jar:1.15.0]
... 56 more
{code}
> Job Manager fails to recover with S3 storage and HA enabled
> -----------------------------------------------------------
>
> Key: FLINK-27855
> URL: https://issues.apache.org/jira/browse/FLINK-27855
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Sushant
> Priority: Major
>
> Flink version: 1.15
> Steps to replicate
> 1. Enable HA and mention S3 path in flink configuration
> 2. Create the flink deployment and let it run for sometime to generate
> checkpoints
> 3. Delete the flink deployment
> 4. Recreate once again and job manager doesn’t come up complaining about S3
> server cleanup
> Note that above steps goes through fine if AWS EFS is being used instead of
> S3 for HA
> Error Traceback:
> {code:java}
> 2022-05-31 16:39:44,332 WARN
> org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] -
> Cleanup of BlobServer failed for job 00000000000000000000000000000000 due to
> a CompletionException: java.io.IOException: java.io.IOException: Error while
> cleaning up the BlobStore for job 00000000000000000000000000000000
> 2022-05-31 16:42:56,955 WARN
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Ignoring
> JobGraph submission (00000000000000000000000000000000) because the job
> already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED)
> in a previous execution.
> 2022-05-31 16:42:57,026 ERROR [] - Error while processing events
> :
> org.apache.flink.util.FlinkException: Failed to execute job
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108)
> ~[flink-dist-1.15.0.jar:1.15.0]
> Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException:
> Job has already been submitted.
> at
> org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
> ~[flink-dist-1.15.0.jar:1.15.0]
> 2022-05-31 16:42:57,130 INFO
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap
> [] - Application CANCELED:
> java.util.concurrent.CompletionException:
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
> Application Status: CANCELED
> at
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
> Caused by:
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
> Application Status: CANCELED
> at
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71)
> ~[flink-dist-1.15.0.jar:1.15.0]
> ... 56 more
> Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was
> cancelled.
> at
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60)
> ~[flink-dist-1.15.0.jar:1.15.0]
> ... 56 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)