[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478436#comment-17478436
 ] 

Gil De Grove commented on FLINK-25649:
--------------------------------------

[~zhuzh] [~trohrmann] ,

I was able to reproduce the issue this morning, I'm gathering the logs and will 
clean them then upload it there, the logs will contain the logs for both the 
job-manager of our HA deployment. 

This morning, I actually only have DataStream (full stream) jobs that run, 3 of 
them cannot be scheduled, they all have a parallelism of 3. 

The job-manager reports 30 task slots available. 

>> One question I have is that when you say "100% of the jobs working", do you 
>> mean "All tasks of all jobs are in RUNNING state at the same time", or "All 
>> bounded jobs can finish and all unbounded jobs' tasks are in RUNING state"?

I mean; all the tasks of all the jobs, bounded or unbounded, are scheduled at 
the same time on the cluster.

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25649
>                 URL: https://issues.apache.org/jira/browse/FLINK-25649
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.1
>            Reporter: Gil De Grove
>            Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to