[
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718964#comment-17718964
]
Matthias Pohl commented on FLINK-31974:
---------------------------------------
I'm still wondering what the desired behavior in that case is. The k8s cluster
doesn't provide the resources so that the Flink cluster would be able to handle
the parallelism of the submitted job. In my opinion, it feels like the fatal
error is correct. [~xtsong] what's your take on that one?
> JobManager crashes after KubernetesClientException exception with
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-31974
> URL: https://issues.apache.org/jira/browse/FLINK-31974
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.17.0
> Reporter: Sergio Sainz
> Priority: Major
>
> When resource quota limit is reached JobManager will throw
>
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Failed requesting worker with resource spec WorkerResourceSpec
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException:
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods.
> Message: Forbidden!Configured service account doesn't have access. Service
> account may have been revoked. pods
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota:
> my-namespace-resource-quota, requested: limits.cpu=3, used:
> limits.cpu=12100m, limited: limits.cpu=13.
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.16.1.jar:1.16.1]
> ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler
> [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15'
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.17.0.jar:1.17.0]
> ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)