[jira] [Commented] (FLINK-22135) Test the adaptive scheduler

Xintong Song (Jira) Wed, 14 Apr 2021 02:59:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320868#comment-17320868
 ]


Xintong Song commented on FLINK-22135:
--------------------------------------

I've done testing with this feature, with a standalone session cluster and a 
native Kubernetes session cluster.

I have only one comment, in addition to those already documented as known 
limitations or reported in FLINK-22134.

The default 'jobmanager.adaptive-scheduler.resource-wait-timeout: 10s' is a bit 
too short for active resource managers. There's initial no workers, and 10s is 
in most cases not enough for newly requested workers to be started and 
register, even there are sufficient resources in the cluster. Consequently, the 
job fail before TMs register. (logs: [^jobmanager_log.txt])

A workaround is to configure larger resource wait timeout, however
- The experience is not that good for someone switching to the adaptive 
scheduler from the default one looking for better tolerance against 
insufficient resources or TM lost, that he/she has to specify another 
configuration.
- Increasing the resource wait timeout also potentially increase the down time 
during failover when resources become insufficient.

> Test the adaptive scheduler
> ---------------------------
>
>                 Key: FLINK-22135
>                 URL: https://issues.apache.org/jira/browse/FLINK-22135
>             Project: Flink
>          Issue Type: Test
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Till Rohrmann
>            Assignee: Xintong Song
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.13.0
>
>         Attachments: jobmanager_log.txt
>
>
> With FLINK-21075, we introduced a new scheduler type which first waits for 
> resources before deciding on the actual parallelism. This allows to continue 
> executing a job even if the cluster loses a {{TaskManager}} permanently. We 
> should test that this feature works as described by its documentation [1] 
> (w/o using the reactive mode).
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/elastic_scaling/#adaptive-scheduler



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22135) Test the adaptive scheduler

Reply via email to