[jira] [Comment Edited] (FLINK-16299) Release containers recovered from previous attempt in which TaskExecutor is not started.

Till Rohrmann (Jira) Thu, 27 Feb 2020 06:57:29 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046699#comment-17046699
 ]


Till Rohrmann edited comment on FLINK-16299 at 2/27/20 2:56 PM:
----------------------------------------------------------------

Thanks for reporting the issue [~xintongsong]. Go ahead [~karmagyz], I've 
assigned you to the issue.

The solution approach sounds good to me.


was (Author: till.rohrmann):
Thanks for reporting the issue [~xintongsong]. Go ahead [~karmagyz], I've 
assigned you to the issue.

> Release containers recovered from previous attempt in which TaskExecutor is 
> not started.
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-16299
>                 URL: https://issues.apache.org/jira/browse/FLINK-16299
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>            Reporter: Xintong Song
>            Assignee: Yangze Guo
>            Priority: Major
>
> As discussed in FLINK-16215, on Yarn deployment, {{YarnResourceManager}} 
> starts a new {{TaskExecutor}} in two steps:
>  # Request a new container from Yarn
>  # Starts a {{TaskExecutor}} process in the allocated container
> If JM failover happens between the two steps, in the new attempt 
> {{YarnResourceManager}} will not start {{TaskExecutor}} processes in 
> recovered containers. That means such containers are neither used nor 
> released.
> A potential fix to this problem, is to query form the container status by 
> calling {{NMClientAsync#getContainerStatusAsync}}, and release the containers 
> whose state is {{NEW}}, keeps only those whose state is {{RUNNING}} and 
> waiting for them to register.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-16299) Release containers recovered from previous attempt in which TaskExecutor is not started.

Reply via email to