[ https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xintong Song updated FLINK-13554: --------------------------------- Fix Version/s: 1.10.0 > ResourceManager should have a timeout on starting new TaskExecutors. > -------------------------------------------------------------------- > > Key: FLINK-13554 > URL: https://issues.apache.org/jira/browse/FLINK-13554 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.9.0 > Reporter: Xintong Song > Priority: Major > Fix For: 1.10.0 > > > Recently, we encountered a case that one TaskExecutor get stuck during > launching on Yarn (without fail), causing that job cannot recover from > continuous failovers. > The reason the TaskExecutor gets stuck is due to our environment problem. The > TaskExecutor gets stuck somewhere after the ResourceManager starts the > TaskExecutor and waiting for the TaskExecutor to be brought up and register. > Later when the slot request timeouts, the job fails over and requests slots > from ResourceManager again, the ResourceManager still see a TaskExecutor (the > stuck one) is being started and will not request new container from Yarn. > Therefore, the job can not recover from failure. > I think to avoid such unrecoverable status, the ResourceManager need to have > a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes > too long, it should just fail the TaskExecutor and starts a new one. -- This message was sent by Atlassian Jira (v8.3.4#803005)