[ 
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006705#comment-17006705
 ] 

Xintong Song commented on FLINK-15456:
--------------------------------------

I think the problem of FLINK-13554 is that, int the time range from container 
is allocated to the task manager successfully registered, Flink relies on Yarn 
to report the container failure. If Yarn is not aware of the abnormality (e.g., 
TM process stuck somewhere and does not terminate), Flink does not take any 
action. To solve it, I think Flink should have a time out for starting TM in 
containers, and should handle the abnormality if the TM does not register in a 
reasonable time.

But yes, let's first try to reproduce this problem with debug logs, see if it's 
indeed the same problem.

> Job keeps failing on slot allocation timeout due to RM not allocating new TMs 
> for slot requests
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15456
>                 URL: https://issues.apache.org/jira/browse/FLINK-15456
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Blocker
>             Fix For: 1.10.0
>
>         Attachments: jm_part.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are 
> registered. So the job fails due to not able to acquire all 30 slots needed 
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask 
> for new TMs even if it cannot fulfill the slot requests. So the job will keep 
> failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to