[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler

Xintong Song (Jira) Wed, 03 May 2023 19:04:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719116#comment-17719116
 ]


Xintong Song commented on FLINK-31974:
--------------------------------------

Thanks [~sergiosp] for reporting, and thanks [~mapohl] for looking into this.

bq. The k8s cluster doesn't provide the resources so that the Flink cluster 
would be able to handle the parallelism of the submitted job.

This is not always true. For streaming workloads in reactive mode, it is 
expected that not all requested resources can be obtained, and as long as the 
minimum resource requirements are fulfilled the job can be executed. Also for 
batch workloads, ideally a job can be executed with a single slot, because 
tasks don't have to be executed at the same time.

Moreover, there's a timeout at the JobMaster side that will fail the job if 
resources cannot be fulfilled within a certain time, with the execution mode 
and minimum resource requirements taken into consideration.

In most cases, the phenomenon for not obtaining a resource is that Flink can 
create the meta of desired pod at K8s API Server and will keep waiting for the 
K8s cluster to schedule and bring up the pod. However, in this case it throws 
an exception, which was not covered by the current implementation.

I think we may identify the specific error and not treat it as fatal error. 
Instead, we can pass this information to JobMaster via 
{{JobMasterGateway#notifyNotEnoughResourcesAvailable}} and rely on JobMaster to 
decide whether should fail the job.

WDYT?

> JobManager crashes after KubernetesClientException exception with 
> FatalExitExceptionHandler
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-31974
>                 URL: https://issues.apache.org/jira/browse/FLINK-31974
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.17.0
>            Reporter: Sergio Sainz
>            Priority: Major
>
> When resource quota limit is reached JobManager will throw
>  
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>  
> In {*}1.16.1 , this is handled gracefully{*}:
> {code}
> 2023-04-28 22:07:24,631 WARN  
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
> Failed requesting worker with resource spec WorkerResourceSpec 
> \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 
> bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb 
> (241591914 bytes), numSlots=4}, current pending count: 0
> java.util.concurrent.CompletionException: 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
> ~[?:?]
>         at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. 
> Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. pods 
> "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: 
> my-namespace-resource-quota, requested: limits.cpu=3, used: 
> limits.cpu=12100m, limited: limits.cpu=13.
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
>  ~[flink-dist-1.16.1.jar:1.16.1]
>         ... 4 more
> {code}
> But , {*}in Flink 1.17.0 , Job Manager crashes{*}:
> {code}
> 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler 
>              [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' 
> produced an uncaught exception. Stopping the process...
> java.util.concurrent.CompletionException: 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown 
> Source) ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
> ~[?:?]
>         at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: POST at: 
> https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: 
> Forbidden!Configured service account doesn't have access. Service account may 
> have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is 
> forbidden: exceeded quota: my-namespace-resource-quota, requested: 
> limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163)
>  ~[flink-dist-1.17.0.jar:1.17.0]
>         ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31974) JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler

Reply via email to