[
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719727#comment-17719727
]
Matthias Pohl commented on FLINK-31974:
---------------------------------------
[~sergiosp] I guess it's not necessary to provide the logs anymore. The problem
is understood and the discussion went on, already.
On the discussion about how to handle errors in this part of the code: tbh,
initially I leaned towards [~xtsong]'s proposal where he suggested to make the
error handling as strict as possible through a whitelist and avoid adding yet
another configuration parameter with the idea in mind that Flink's deployment
environment should be in a healthy state without any mis-configuration. But as
the discussion moved on, I started to acknowledge that it's too strict in quite
a few scenarios. I also get [~gyfora]'s point that we're not that restrictive
in other places of the code base, either.
One concern I have with the error whitelisting, though, is that the error
classification could become "complex". The error [~sergiosp] shared was about
hitting quota limits. The error type we're seeing is a Forbidden error
(unfortunately, without the error code being logged but I would assume 403
analogously to the HTTP error code). I could imagine this error type also being
returned in other cases (e.g. wrong service account being used). The former
error is something we want to retry in certain scenarios but the latter one
(based on my understanding) would be one that could be considered a general
infrastructure issue and, as a consequence, could be treated as a fatal error.
It looks like it would require error message parsing to identify the type of
error. How confident are we about the stability of those error messages? It
looks like they are derived from the k8s HTTP responses and, therefore, might
be stable among different Kubernetes versions. But generally, relying on error
messages for deriving Flink's behavior feels not right. Is this a valid
concern? In this sense, I started to favor what was proposed by [~gyfora] in
the discussion.
I might be wrong here because I'm not that familiar with the k8s API. I wanted
to share this, anyway.
> JobManager crashes after KubernetesClientException exception with
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-31974
> URL: https://issues.apache.org/jira/browse/FLINK-31974
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.17.0
> Reporter: Sergio Sainz
> Assignee: Weijie Guo
> Priority: Major
>
> When resource quota limit is reached JobManager will throw
>
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Failed requesting worker with resource spec WorkerResourceSpec
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException:
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods.
> Message: Forbidden!Configured service account doesn't have access. Service
> account may have been revoked. pods
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota:
> my-namespace-resource-quota, requested: limits.cpu=3, used:
> limits.cpu=12100m, limited: limits.cpu=13.
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.16.1.jar:1.16.1]
> ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler
> [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15'
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.17.0.jar:1.17.0]
> ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)