[ 
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719247#comment-17719247
 ] 

Gyula Fora commented on FLINK-31974:
------------------------------------

Flink treats only very few errors fatal. IO errors, connector (source/sink ) 
errors etc all cause job restarts and in many cases "Flink cannot recover from 
by itself". You actually expect the error to be temporary and hopefully not get 
it after the restart. So I think it would be generally inconsistent with the 
current error handling behaviour if resource manager errors would simply let 
the job die fatally and not retry in the same way.

So I am mostly looking at this from the user perspective. Flink jobs/clusters 
should be resilient and keep retrying in case of errors and should not give up 
especially for streaming workloads. This is how it works now and this what most 
users expect I think.

> JobManager crashes after KubernetesClientException exception with 
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-31974
>                 URL: https://issues.apache.org/jira/browse/FLINK-31974
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.17.0
>            Reporter: Sergio Sainz
>            Assignee: Weijie Guo
>            Priority: Major
>
> When resource quota limit is reached JobManager will throw
>  
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>  
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
> Failed requesting worker with resource spec WorkerResourceSpec 
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb 
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException: 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
> ~[?:?]
>         at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. 
> Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. pods 
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: 
> my-namespace-resource-quota, requested: limits.cpu=3, used: 
> limits.cpu=12100m, limited: limits.cpu=13.
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler 
>              [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' 
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException: 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
> ~[?:?]
>         at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to