[
https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010460#comment-17010460
]
Xintong Song commented on FLINK-13554:
--------------------------------------
IMO, I think a clean solution should be RM monitors a timeout for starting new
TMs. But this approach includes introducing config options for the timeout,
monitoring timeout asynchronously, properly un-monitoring on TM registration,
which may not be suitable to add after the feature freeze.
Also, it seems not to be a common case. We do not see any report of this bug
from the users. We run into this problem (both this ticket and FLINK-15456)
only when testing the stability of Flink with ChaosMonkey intentionally
breaking the network connections.
Therefore, I'm in favor of not fixing this problem in release 1.10.0.
> ResourceManager should have a timeout on starting new TaskExecutors.
> --------------------------------------------------------------------
>
> Key: FLINK-13554
> URL: https://issues.apache.org/jira/browse/FLINK-13554
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.9.0
> Reporter: Xintong Song
> Priority: Critical
> Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during
> launching on Yarn (without fail), causing that job cannot recover from
> continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The
> TaskExecutor gets stuck somewhere after the ResourceManager starts the
> TaskExecutor and waiting for the TaskExecutor to be brought up and register.
> Later when the slot request timeouts, the job fails over and requests slots
> from ResourceManager again, the ResourceManager still see a TaskExecutor (the
> stuck one) is being started and will not request new container from Yarn.
> Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have
> a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes
> too long, it should just fail the TaskExecutor and starts a new one.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)