[
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719355#comment-17719355
]
Thomas Weise commented on FLINK-31974:
--------------------------------------
There are many cases where errors are transient. This specific case is actually
quite obvious, the resource availability on a large cluster is changing
constantly. A pod may not be scheduled now but few seconds later. Other k8s
related issues can also be transient, for example a failed request due to rate
limiting will likely succeed soon after and we would actually make things worse
by not following a backoff/retry strategy and simply letting the job fail. I'm
also leaning more towards retry by default strategy and identify the cases that
should be fatal error.
> JobManager crashes after KubernetesClientException exception with
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-31974
> URL: https://issues.apache.org/jira/browse/FLINK-31974
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.17.0
> Reporter: Sergio Sainz
> Assignee: Weijie Guo
> Priority: Major
>
> When resource quota limit is reached JobManager will throw
>
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Failed requesting worker with resource spec WorkerResourceSpec
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException:
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods.
> Message: Forbidden!Configured service account doesn't have access. Service
> account may have been revoked. pods
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota:
> my-namespace-resource-quota, requested: limits.cpu=3, used:
> limits.cpu=12100m, limited: limits.cpu=13.
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.16.1.jar:1.16.1]
> ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler
> [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15'
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.17.0.jar:1.17.0]
> ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)