[
https://issues.apache.org/jira/browse/FLINK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yangze Guo resolved FLINK-16299.
--------------------------------
Resolution: Not A Problem
> Release containers recovered from previous attempt in which TaskExecutor is
> not started.
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-16299
> URL: https://issues.apache.org/jira/browse/FLINK-16299
> Project: Flink
> Issue Type: Improvement
> Components: Deployment / YARN
> Reporter: Xintong Song
> Assignee: Yangze Guo
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> As discussed in FLINK-16215, on Yarn deployment, {{YarnResourceManager}}
> starts a new {{TaskExecutor}} in two steps:
> # Request a new container from Yarn
> # Starts a {{TaskExecutor}} process in the allocated container
> If JM failover happens between the two steps, in the new attempt
> {{YarnResourceManager}} will not start {{TaskExecutor}} processes in
> recovered containers. That means such containers are neither used nor
> released.
> A potential fix to this problem is to query for the container status by
> calling {{NMClientAsync#getContainerStatusAsync}}, and release the containers
> whose state is {{NEW}}, keeps only those whose state is {{RUNNING}} and
> waiting for them to register.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)