[
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719331#comment-17719331
]
Xintong Song commented on FLINK-31974:
--------------------------------------
[~gyfora],
bq. Flink treats only very few errors fatal. IO errors, connector (source/sink
) errors etc all cause job restarts and in many cases "Flink cannot recover
from by itself". You actually expect the error to be temporary and hopefully
not get it after the restart. So I think it would be generally inconsistent
with the current error handling behaviour if resource manager errors would
simply let the job die fatally and not retry in the same way.
I think the difference here is that, for IO errors and connector errors, it
affects the job but not the Flink cluster / deployment. Thinking of a session
cluster, we should not fail the cluster for an error from a single job. But for
resource manager interacting with Kubernetes API server, this is a cluster
behavior and conceptually we don't distinguish resources for individual jobs
until the slots are allocated. Moreover, it's possible that multiple jobs share
the same resource (pod). One could argue that in application mode the cluster /
deployment is equivalent to the job. However, the cluster mode (session /
application) is transparent to the resource manager.
bq. Flink jobs/clusters should be resilient and keep retrying in case of
errors and should not give up especially for streaming workloads.
This is different from the feedback that I get from our production. But I can
understand if that's what some of the users want. So I guess maybe it worth a
configuration option as you suggested.
[~mbalassi],
+1 to what you said about the specific case. I think there's a consensus on
reaching quota limit should not be treated as fatal errors.
> JobManager crashes after KubernetesClientException exception with
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-31974
> URL: https://issues.apache.org/jira/browse/FLINK-31974
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.17.0
> Reporter: Sergio Sainz
> Assignee: Weijie Guo
> Priority: Major
>
> When resource quota limit is reached JobManager will throw
>
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Failed requesting worker with resource spec WorkerResourceSpec
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException:
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods.
> Message: Forbidden!Configured service account doesn't have access. Service
> account may have been revoked. pods
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota:
> my-namespace-resource-quota, requested: limits.cpu=3, used:
> limits.cpu=12100m, limited: limits.cpu=13.
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.16.1.jar:1.16.1]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.16.1.jar:1.16.1]
> ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler
> [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15'
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) ~[?:?]
> at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown
> Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by:
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message:
> Forbidden!Configured service account doesn't have access. Service account may
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is
> forbidden: exceeded quota: my-namespace-resource-quota, requested:
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
> ~[flink-dist-1.17.0.jar:1.17.0]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
> ~[flink-dist-1.17.0.jar:1.17.0]
> ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)