[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler

Matthias Pohl (Jira) Fri, 05 May 2023 01:01:09 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719727#comment-17719727
 ]


Matthias Pohl commented on FLINK-31974:
---------------------------------------

[~sergiosp] I guess it's not necessary to provide the logs anymore. The problem 
is understood and the discussion went on, already.

On the discussion about how to handle errors in this part of the code: tbh, 
initially I leaned towards [~xtsong]'s proposal where he suggested to make the 
error handling as strict as possible through a whitelist and avoid adding yet 
another configuration parameter with the idea in mind that Flink's deployment 
environment should be in a healthy state without any mis-configuration. But as 
the discussion moved on, I started to acknowledge that it's too strict in quite 
a few scenarios. I also get [~gyfora]'s point that we're not that restrictive 
in other places of the code base, either.

One concern I have with the error whitelisting, though, is that the error 
classification could become "complex". The error [~sergiosp] shared was about 
hitting quota limits. The error type we're seeing is a Forbidden error 
(unfortunately, without the error code being logged but I would assume 403 
analogously to the HTTP error code). I could imagine this error type also being 
returned in other cases (e.g. wrong service account being used). The former 
error is something we want to retry in certain scenarios but the latter one 
(based on my understanding) would be one that could be considered a general 
infrastructure issue and, as a consequence, could be treated as a fatal error. 
It looks like it would require error message parsing to identify the type of 
error. How confident are we about the stability of those error messages? It 
looks like they are derived from the k8s HTTP responses and, therefore, might 
be stable among different Kubernetes versions. But generally, relying on error 
messages for deriving Flink's behavior feels not right. Is this a valid 
concern? In this sense, I started to favor what was proposed by [~gyfora] in 
the discussion.

I might be wrong here because I'm not that familiar with the k8s API. I wanted 
to share this, anyway.

> JobManager crashes after KubernetesClientException exception with 
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-31974
>                 URL: https://issues.apache.org/jira/browse/FLINK-31974
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.17.0
>            Reporter: Sergio Sainz
>            Assignee: Weijie Guo
>            Priority: Major
>
> When resource quota limit is reached JobManager will throw
>  
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>  
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
> Failed requesting worker with resource spec WorkerResourceSpec 
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb 
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException: 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
> ~[?:?]
>         at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. 
> Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. pods 
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: 
> my-namespace-resource-quota, requested: limits.cpu=3, used: 
> limits.cpu=12100m, limited: limits.cpu=13.
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler 
>              [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' 
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException: 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
> ~[?:?]
>         at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler

Reply via email to