[
https://issues.apache.org/jira/browse/FLINK-26772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mika Naylor updated FLINK-26772:
--------------------------------
Priority: Critical (was: Blocker)
> HA Application Mode does not retry resource cleanup
> ---------------------------------------------------
>
> Key: FLINK-26772
> URL: https://issues.apache.org/jira/browse/FLINK-26772
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Mika Naylor
> Priority: Critical
> Fix For: 1.15.0
>
> Attachments: testcluster-599f4d476b-bghw5_log.txt
>
>
> I set up a scenario in which a k8s native cluster running in Application Mode
> used an s3 bucket for it's high availability storage directory, with the
> hadoop plugin. The credentials the cluster used gives it permission to write
> to the bucket, but not delete, so cleaning up the blob/jobgraph will fail.
> I expected that when trying to clean up the HA resources, it would attempt to
> retry the cleanup. I even configured this explicitly:
> {{cleanup-strategy: fixed-delay}}
> {{cleanup-strategy.fixed-delay.attempts: 100}}
> {{cleanup-strategy.fixed-delay.delay: 10 s}}
> However, the behaviour I observed is that the blob and jobgraph cleanup is
> only attempted once. After this failure, I observe in the logs that:
> {{2022-03-21 09:34:40,634 INFO
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap
> [] - Application completed SUCCESSFULLY}}
> {{2022-03-21 09:34:40,635 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting
> KubernetesApplicationClusterEntrypoint down with application status
> SUCCEEDED. Diagnostics null.}}
> After which, the cluster recieves a SIGTERM an exits.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)