[ https://issues.apache.org/jira/browse/FLINK-26772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mika Naylor updated FLINK-26772: -------------------------------- Description: I set up a scenario in which a k8s native cluster running in Application Mode used an s3 bucket for it's high availability storage directory, with the hadoop plugin. The credentials the cluster used gives it permission to write to the bucket, but not delete, so cleaning up the blob/jobgraph will fail. I expected that when trying to clean up the HA resources, it would attempt to retry the cleanup. I even configured this explicitly: {{cleanup-strategy: fixed-delay}} {{cleanup-strategy.fixed-delay.attempts: 100}} {{cleanup-strategy.fixed-delay.delay: 10 s}} However, the behaviour I observed is that the blob and jobgraph cleanup is only attempted once. After this failure, I observe in the logs that: {{2022-03-21 09:34:40,634 INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application completed SUCCESSFULLY}} {{2022-03-21 09:34:40,635 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting KubernetesApplicationClusterEntrypoint down with application status SUCCEEDED. Diagnostics null.}} After which, the cluster recieves a SIGTERM an exits. was: I set up a scenario in which a k8s native cluster running in Application Mode used an s3 bucket for it's high availability storage directory, with the hadoop plugin. The credentials the cluster used gives it permission to write to the bucket, but not delete, so cleaning up the blob/jobgraph will fail. I expected that when trying to clean up the HA resources, it would attempt to retry the cleanup. I even configured this explicitly: {{cleanup-strategy: fixed-delay cleanup-strategy.fixed-delay.attempts: 100 cleanup-strategy.fixed-delay.delay: 10 s}} However, the behaviour I observed is that the blob and jobgraph cleanup is only attempted once. After this failure, I observe in the logs that: {{2022-03-21 09:34:40,634 INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application completed SUCCESSFULLY 2022-03-21 09:34:40,635 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting KubernetesApplicationClusterEntrypoint down with application status SUCCEEDED. Diagnostics null.}} After which, the cluster recieves a SIGTERM an exits. > Kubernetes Native in HA Application Mode does not retry resource cleanup > ------------------------------------------------------------------------ > > Key: FLINK-26772 > URL: https://issues.apache.org/jira/browse/FLINK-26772 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.15.0 > Reporter: Mika Naylor > Priority: Blocker > > I set up a scenario in which a k8s native cluster running in Application Mode > used an s3 bucket for it's high availability storage directory, with the > hadoop plugin. The credentials the cluster used gives it permission to write > to the bucket, but not delete, so cleaning up the blob/jobgraph will fail. > I expected that when trying to clean up the HA resources, it would attempt to > retry the cleanup. I even configured this explicitly: > {{cleanup-strategy: fixed-delay}} > {{cleanup-strategy.fixed-delay.attempts: 100}} > {{cleanup-strategy.fixed-delay.delay: 10 s}} > However, the behaviour I observed is that the blob and jobgraph cleanup is > only attempted once. After this failure, I observe in the logs that: > {{2022-03-21 09:34:40,634 INFO > org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap > [] - Application completed SUCCESSFULLY}} > {{2022-03-21 09:34:40,635 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting > KubernetesApplicationClusterEntrypoint down with application status > SUCCEEDED. Diagnostics null.}} > After which, the cluster recieves a SIGTERM an exits. -- This message was sent by Atlassian Jira (v8.20.1#820001)