[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719727#comment-17719727 ]
Matthias Pohl commented on FLINK-31974: --------------------------------------- [~sergiosp] I guess it's not necessary to provide the logs anymore. The problem is understood and the discussion went on, already. On the discussion about how to handle errors in this part of the code: tbh, initially I leaned towards [~xtsong]'s proposal where he suggested to make the error handling as strict as possible through a whitelist and avoid adding yet another configuration parameter with the idea in mind that Flink's deployment environment should be in a healthy state without any mis-configuration. But as the discussion moved on, I started to acknowledge that it's too strict in quite a few scenarios. I also get [~gyfora]'s point that we're not that restrictive in other places of the code base, either. One concern I have with the error whitelisting, though, is that the error classification could become "complex". The error [~sergiosp] shared was about hitting quota limits. The error type we're seeing is a Forbidden error (unfortunately, without the error code being logged but I would assume 403 analogously to the HTTP error code). I could imagine this error type also being returned in other cases (e.g. wrong service account being used). The former error is something we want to retry in certain scenarios but the latter one (based on my understanding) would be one that could be considered a general infrastructure issue and, as a consequence, could be treated as a fatal error. It looks like it would require error message parsing to identify the type of error. How confident are we about the stability of those error messages? It looks like they are derived from the k8s HTTP responses and, therefore, might be stable among different Kubernetes versions. But generally, relying on error messages for deriving Flink's behavior feels not right. Is this a valid concern? In this sense, I started to favor what was proposed by [~gyfora] in the discussion. I might be wrong here because I'm not that familiar with the k8s API. I wanted to share this, anyway. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > ------------------------------------------------------------------------------------------- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.17.0 > Reporter: Sergio Sainz > Assignee: Weijie Guo > Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler > [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' > produced an uncaught exception. Stopping the process... > java.util.concurrent.CompletionException: > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.17.0.jar:1.17.0] > ... 4 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)