[ https://issues.apache.org/jira/browse/FLINK-29308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611349#comment-17611349 ]
Zhu Zhu commented on FLINK-29308: --------------------------------- It's possible that the released resource cannot fulfill the new slot request. Therefore Flink use this mechanism to fail fast, otherwise users may find a job pending for quite sometime with no progress. This may be helpful when the resource cluster is problematic or there is mis-configuration in the job(wrong resource spec, wrong resource queue, etc). One potential problem to ignore the NoResourceAvailableException is that a job may wait indefinitely until it can obtain a required slot. You can do it if it's acceptable in your case. > NoResourceAvailableException fails the batch job > ------------------------------------------------ > > Key: FLINK-29308 > URL: https://issues.apache.org/jira/browse/FLINK-29308 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Aitozi > Priority: Major > > When running batch job configured with the following restart strategy > {code:java} > restart-strategy: fixed-delay > restart-strategy.fixed-delay.delay: 15 s > restart-strategy.fixed-delay.attempts: 10 {code} > If the cluster resource is not enough to run the single stage, it can run > partial of the stage, but it still will fail after the 10 times > {{{}NoResourceAvailableException{}}}. IMO, for batch job the > {{NoResourceAvailableException}} do not necessary to trigger the job to fail. > Or at least this failure reason is not suitable to share the same restart > strategy with other failure reasons -- This message was sent by Atlassian Jira (v8.20.10#820010)