[ https://issues.apache.org/jira/browse/FLINK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046248#comment-17046248 ]
Yangze Guo edited comment on FLINK-16299 at 2/27/20 7:17 AM: ------------------------------------------------------------- I'd like to work on it. was (Author: karmagyz): I'd like work on it. > Release containers recovered from previous attempt in which TaskExecutor is > not started. > ---------------------------------------------------------------------------------------- > > Key: FLINK-16299 > URL: https://issues.apache.org/jira/browse/FLINK-16299 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN > Reporter: Xintong Song > Priority: Major > > As discussed in FLINK-16215, on Yarn deployment, {{YarnResourceManager}} > starts a new {{TaskExecutor}} in two steps: > # Request a new container from Yarn > # Starts a {{TaskExecutor}} process in the allocated container > If JM failover happens between the two steps, in the new attempt > {{YarnResourceManager}} will not start {{TaskExecutor}} processes in > recovered containers. That means such containers are neither used nor > released. > A potential fix to this problem, is to query form the container status by > calling {{NMClientAsync#getContainerStatusAsync}}, and release the containers > whose state is {{NEW}}, keeps only those whose state is {{RUNNING}} and > waiting for them to register. -- This message was sent by Atlassian Jira (v8.3.4#803005)