[jira] [Created] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Gil De Grove (Jira) Thu, 13 Jan 2022 05:59:11 -0800

Gil De Grove created FLINK-25649:
------------------------------------

             Summary: Scheduling jobs fails with 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
                 Key: FLINK-25649
                 URL: https://issues.apache.org/jira/browse/FLINK-25649
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.13.1
            Reporter: Gil De Grove

Following comment from Till on this [SO
question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
h2. *Summary*

We are currently experiencing a scheduling issue with our flink cluster.

The symptoms are that some/most/all (it depend, the symptoms are not always the
same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. The jobs
are them showed a _RUNNING_

The failing exception is the following one:

{{Caused by: java.util.concurrent.CompletionException:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Slot request bulk is not fulfillable! Could not allocate the required slot
within slot request timeout}}

After analysis, we assume (we cannot prove it, as there are not that much logs
for that part of the code) that the failure is due to a deadlock/race condition
that is happening when several jobs are being submitted at the same time to the
flink cluster, even though we have enough slots available in the cluster.

We actually have the error with 52 available task slots, and have 12 jobs that
are not scheduled.
h2. Additional information
* Flink version: 1.13.1 commit a7f3192
* Flink cluster in session mode
* 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits
sets on memory to 4Gb)
* 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No
limits set).
* Our Flink cluster is shut down every night, and restarted every morning. The
error seems to occur when a lot of jobs needs to be scheduled. The jobs are
configured to restore their state, and we do not see any issues for jobs that
are being scheduled and run correctly, it seems to really be related to a
scheduling issue.

--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Reply via email to