[
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475999#comment-17475999
]
Till Rohrmann edited comment on FLINK-25649 at 1/14/22, 7:30 AM:
-----------------------------------------------------------------
cc [~dmvk], [~zhuzh]
was (Author: till.rohrmann):
cc [~dmvk]
> Scheduling jobs fails with
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -----------------------------------------------------------------------------------------------------
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.13.1
> Reporter: Gil De Grove
> Priority: Major
>
> Following comment from Till on this [SO
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout.
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much
> logs for that part of the code) that the failure is due to a deadlock/race
> condition that is happening when several jobs are being submitted at the same
> time to the flink cluster, even though we have enough slots available in the
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs
> that are not scheduled.
> h2. Additional information
> * Flink version: 1.13.1 commit a7f3192
> * Flink cluster in session mode
> * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram,
> limits sets on memory to 4Gb)
> * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No
> limits set).
> * Our Flink cluster is shut down every night, and restarted every morning.
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs
> are configured to restore their state, and we do not see any issues for jobs
> that are being scheduled and run correctly, it seems to really be related to
> a scheduling issue.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)