[
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719275#comment-17719275
]
Márton Balassi commented on FLINK-31974:
----------------------------------------
In the specific case I much prefer the behaviour exhibited by 1.16.1. Resource
quota not being available changes dynamically, if the JobManager kept retrying
(ideally with a backoff) it is not unreasonable to expect that eventually it
could succeed in most real-world scenarios. Adding some guardrails around this
(if a minimum parallelism is not satisfied fail instead, if a max timeout is
reached fail etc) to avoid ending up with many small jobs competing for
insufficient resources and wasting capacity would be acceptable to me, but
outright failing on the first try is more a bug than a feature imho. :)
> JobManager crashes after KubernetesClientException exception with
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-31974
> URL: https://issues.apache.org/jira/browse/FLINK-31974
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.17.0
> Reporter: Sergio Sainz
> Assignee: Weijie Guo
> Priority: Major
>
> When resource quota limit is reached JobManager will throw
>
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Failed requesting worker with resource spec WorkerResourceSpec
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException:
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods.
> Message: Forbidden!Configured service account doesn't have access. Service
> account may have been revoked. pods
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota:
> my-namespace-resource-quota, requested: limits.cpu=3, used:
> limits.cpu=12100m, limited: limits.cpu=13.
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.16.1.jar:1.16.1]
> ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler
> [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15'
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.17.0.jar:1.17.0]
> ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)