[jira] [Comment Edited] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Till Rohrmann (Jira) Thu, 13 Jan 2022 23:31:53 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475999#comment-17475999
 ]


Till Rohrmann edited comment on FLINK-25649 at 1/14/22, 7:30 AM:
-----------------------------------------------------------------

cc [~dmvk], [~zhuzh]


was (Author: till.rohrmann):
cc [~dmvk]

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25649
>                 URL: https://issues.apache.org/jira/browse/FLINK-25649
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.1
>            Reporter: Gil De Grove
>            Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Reply via email to