[jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers

ASF GitHub Bot (JIRA) Thu, 03 May 2018 23:11:13 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463404#comment-16463404
 ]


ASF GitHub Bot commented on FLINK-9190:
---------------------------------------

Github user sihuazhou commented on the issue:

    https://github.com/apache/flink/pull/5931
  
    Hi @shuai-xu, If I'm not misunderstand, I think your approach is exactly 
what I have done in the previous 
[PR](https://github.com/apache/flink/pull/5881) for this ticket, but it faces 
the same problem as that faced by this PR. That's even the  container 
registered with RM successfully, but after RM offering the slot to JM, the 
container was killed before it registered with JM successfully. I think one way 
to overcome this is that the RM should notify the JM which TM it will connect 
with before the RM assign the slot to it, this way JM could be notified that 
the TM was killed before connecting with it successfully.


> YarnResourceManager sometimes does not request new Containers
> -------------------------------------------------------------
>
>                 Key: FLINK-9190
>                 URL: https://issues.apache.org/jira/browse/FLINK-9190
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.8.3
> ZooKeeper 3.4.5
> Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8
>            Reporter: Gary Yao
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>         Attachments: yarn-logs
>
>
> *Description*
> The {{YarnResourceManager}} does not request new containers if 
> {{TaskManagers}} are killed rapidly in succession. After 5 minutes the job is 
> restarted due to {{NoResourceAvailableException}}, and the job runs normally 
> afterwards. I suspect that {{TaskManager}} failures are not registered if the 
> failure occurs before the {{TaskManager}} registers with the master. Logs are 
> attached; I added additional log statements to 
> {{YarnResourceManager.onContainersCompleted}} and 
> {{YarnResourceManager.onContainersAllocated}}.
> *Expected Behavior*
> The {{YarnResourceManager}} should recognize that the container is completed 
> and keep requesting new containers. The job should run as soon as resources 
> are available. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers

Reply via email to