[
https://issues.apache.org/jira/browse/FLINK-31457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700961#comment-17700961
]
Weihua Hu commented on FLINK-31457:
-----------------------------------
[~a.pilipenko] IIUC, this issue is caused by "slot.request.timeout" not taking
effect in standalone cluster.
The relevant context is that standalone clusters cannot request task managers
dynamically, so in most case, the wait of slot.request.timeout is meaningless.
For example: we have a standalone cluster with 10 slots on 2 task manager. When
we submit a job with parallelism is 15, this job will never started in this
cluster. In this scenario, fast failover with NoResourceAvailableException is a
better way to inform user.
In your case, when some task managers crashed, the Flink cluster has no way of
knowing if other task managers will be started. So the job will immediately
fail with NoResourceAvailableException.
You can reserve more task managers to prevent accidental crash, or just
increase the restart number or restart delay to give the new task manager a
chance to register before job failed.
> Support waiting for required resources in DefaultScheduler during job restart
> -----------------------------------------------------------------------------
>
> Key: FLINK-31457
> URL: https://issues.apache.org/jira/browse/FLINK-31457
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.15.3
> Reporter: Aleksandr Pilipenko
> Priority: Major
>
> Currently Flink support [waiting for required resources to become
> available|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout]
> during job restart only while using adaptive scheduler.
> On the other hand, if cluster is using default scheduler and there is not
> enough slots available - restart attempts will fail with
> `NoResourceAvailableException` until resource requirements are satisfied.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)