[ https://issues.apache.org/jira/browse/FLINK-22135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320868#comment-17320868 ]
Xintong Song commented on FLINK-22135: -------------------------------------- I've done testing with this feature, with a standalone session cluster and a native Kubernetes session cluster. I have only one comment, in addition to those already documented as known limitations or reported in FLINK-22134. The default 'jobmanager.adaptive-scheduler.resource-wait-timeout: 10s' is a bit too short for active resource managers. There's initial no workers, and 10s is in most cases not enough for newly requested workers to be started and register, even there are sufficient resources in the cluster. Consequently, the job fail before TMs register. (logs: [^jobmanager_log.txt]) A workaround is to configure larger resource wait timeout, however - The experience is not that good for someone switching to the adaptive scheduler from the default one looking for better tolerance against insufficient resources or TM lost, that he/she has to specify another configuration. - Increasing the resource wait timeout also potentially increase the down time during failover when resources become insufficient. > Test the adaptive scheduler > --------------------------- > > Key: FLINK-22135 > URL: https://issues.apache.org/jira/browse/FLINK-22135 > Project: Flink > Issue Type: Test > Components: Runtime / Coordination > Affects Versions: 1.13.0 > Reporter: Till Rohrmann > Assignee: Xintong Song > Priority: Blocker > Labels: release-testing > Fix For: 1.13.0 > > Attachments: jobmanager_log.txt > > > With FLINK-21075, we introduced a new scheduler type which first waits for > resources before deciding on the actual parallelism. This allows to continue > executing a job even if the cluster loses a {{TaskManager}} permanently. We > should test that this feature works as described by its documentation [1] > (w/o using the reactive mode). > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/elastic_scaling/#adaptive-scheduler -- This message was sent by Atlassian Jira (v8.3.4#803005)