[
https://issues.apache.org/jira/browse/FLINK-22135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320868#comment-17320868
]
Xintong Song commented on FLINK-22135:
--------------------------------------
I've done testing with this feature, with a standalone session cluster and a
native Kubernetes session cluster.
I have only one comment, in addition to those already documented as known
limitations or reported in FLINK-22134.
The default 'jobmanager.adaptive-scheduler.resource-wait-timeout: 10s' is a bit
too short for active resource managers. There's initial no workers, and 10s is
in most cases not enough for newly requested workers to be started and
register, even there are sufficient resources in the cluster. Consequently, the
job fail before TMs register. (logs: [^jobmanager_log.txt])
A workaround is to configure larger resource wait timeout, however
- The experience is not that good for someone switching to the adaptive
scheduler from the default one looking for better tolerance against
insufficient resources or TM lost, that he/she has to specify another
configuration.
- Increasing the resource wait timeout also potentially increase the down time
during failover when resources become insufficient.
> Test the adaptive scheduler
> ---------------------------
>
> Key: FLINK-22135
> URL: https://issues.apache.org/jira/browse/FLINK-22135
> Project: Flink
> Issue Type: Test
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Till Rohrmann
> Assignee: Xintong Song
> Priority: Blocker
> Labels: release-testing
> Fix For: 1.13.0
>
> Attachments: jobmanager_log.txt
>
>
> With FLINK-21075, we introduced a new scheduler type which first waits for
> resources before deciding on the actual parallelism. This allows to continue
> executing a job even if the cluster loses a {{TaskManager}} permanently. We
> should test that this feature works as described by its documentation [1]
> (w/o using the reactive mode).
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/elastic_scaling/#adaptive-scheduler
--
This message was sent by Atlassian Jira
(v8.3.4#803005)