[
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478436#comment-17478436
]
Gil De Grove commented on FLINK-25649:
--------------------------------------
[~zhuzh] [~trohrmann] ,
I was able to reproduce the issue this morning, I'm gathering the logs and will
clean them then upload it there, the logs will contain the logs for both the
job-manager of our HA deployment.
This morning, I actually only have DataStream (full stream) jobs that run, 3 of
them cannot be scheduled, they all have a parallelism of 3.
The job-manager reports 30 task slots available.
>> One question I have is that when you say "100% of the jobs working", do you
>> mean "All tasks of all jobs are in RUNNING state at the same time", or "All
>> bounded jobs can finish and all unbounded jobs' tasks are in RUNING state"?
I mean; all the tasks of all the jobs, bounded or unbounded, are scheduled at
the same time on the cluster.
> Scheduling jobs fails with
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -----------------------------------------------------------------------------------------------------
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.13.1
> Reporter: Gil De Grove
> Priority: Major
>
> Following comment from Till on this [SO
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout.
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much
> logs for that part of the code) that the failure is due to a deadlock/race
> condition that is happening when several jobs are being submitted at the same
> time to the flink cluster, even though we have enough slots available in the
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs
> that are not scheduled.
> h2. Additional information
> * Flink version: 1.13.1 commit a7f3192
> * Flink cluster in session mode
> * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram,
> limits sets on memory to 4Gb)
> * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No
> limits set).
> * Our Flink cluster is shut down every night, and restarted every morning.
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs
> are configured to restore their state, and we do not see any issues for jobs
> that are being scheduled and run correctly, it seems to really be related to
> a scheduling issue.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)