[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Zhu Zhu (Jira) Mon, 17 Jan 2022 01:45:08 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477072#comment-17477072
 ]


Zhu Zhu commented on FLINK-25649:
---------------------------------

>From your description, it's possible that your cluster does not have enough 
>resources to run all job *at the same time*.
There is a known issue of session cluster resource deadlocks if it runs 
multiple bounded streaming jobs(or batch jobs with PIPELINED edges).
For example, there is a cluster have 30 slots. We want to run 2 bounded 
streaming jobs on it, each job needs 20 slots. 
- If the 2 jobs are submitted at the same time, there is *chance* that the 2 
jobs each acquired 15 slots and neither can run. The slots will be kept by the 
SlotPool of each job. The slots cannot be returned to the ResourceManager 
because they will be requested by the job again before reaching the idle 
timeout.
- If the 2 jobs are submitted with an interval, the earlier job may acquire 20 
slots and start running. The latter submitted one can only acquire 10 slots and 
have to wait, but after the earlier job finishes, the slots will be returned 
and the latter can start running.

If this is your case, you can try setting config `slot.idle.timeout` to `0`, 
and maybe increase config `restart-strategy.fixed-delay.delay` a bit. So that 
when a job failed to acquire enough slots within timeout, it will soon return 
all the slots to the ResourceManager and other jobs will be able get the slots 
to run. This may mitigate your problem, although slot requests timeout may 
still happen at times.

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25649
>                 URL: https://issues.apache.org/jira/browse/FLINK-25649
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.1
>            Reporter: Gil De Grove
>            Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Reply via email to