[ 
https://issues.apache.org/jira/browse/FLINK-29308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611349#comment-17611349
 ] 

Zhu Zhu commented on FLINK-29308:
---------------------------------

It's possible that the released resource cannot fulfill the new slot request. 
Therefore Flink use this mechanism to fail fast, otherwise users may find a job 
pending for quite sometime with no progress. This may be helpful when the 
resource cluster is problematic or there is mis-configuration in the job(wrong 
resource spec, wrong resource queue, etc).
One potential problem to ignore the NoResourceAvailableException is that a job 
may wait indefinitely until it can obtain a required slot. You can do it if 
it's acceptable in your case.

> NoResourceAvailableException fails the batch job
> ------------------------------------------------
>
>                 Key: FLINK-29308
>                 URL: https://issues.apache.org/jira/browse/FLINK-29308
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Aitozi
>            Priority: Major
>
> When running batch job configured with the following restart strategy
> {code:java}
> restart-strategy: fixed-delay
> restart-strategy.fixed-delay.delay: 15 s
> restart-strategy.fixed-delay.attempts: 10 {code}
> If the cluster resource is not enough to run the single stage, it can run 
> partial of the stage, but it still will fail after the 10 times 
> {{{}NoResourceAvailableException{}}}. IMO, for batch job the 
> {{NoResourceAvailableException}} do not necessary to trigger the job to fail. 
> Or at least this failure reason is not suitable to share the same restart 
> strategy with other failure reasons



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to