[ 
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009644#comment-17009644
 ] 

Zhu Zhu commented on FLINK-15456:
---------------------------------

I just reproduced the issue with debug logs enabled. See attached files jm.log 
and tm_container_07.log.
>From jm.log, RM started 28 containers but only 27 successfully registered 
>back. The pending one, which is container_e14_1578278362819_0013_29_000007, 
>failed to find RM leader on zk so it did not register to RM (this is 
>intentioned triggered in this stability test).
So I think it is the issue described in FLINK-13554.
[~xintongsong] would you help to confirm it? If so, we can make it critical to 
not block 1.10 release since it has been there since previous Flink versions, 
but I'd still prefer to fix it in 1.10.

> Job keeps failing on slot allocation timeout due to RM not allocating new TMs 
> for slot requests
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15456
>                 URL: https://issues.apache.org/jira/browse/FLINK-15456
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Blocker
>             Fix For: 1.10.0
>
>         Attachments: jm.log, jm_part.log, jm_part2.log, tm_container_07.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are 
> registered. So the job fails due to not able to acquire all 30 slots needed 
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask 
> for new TMs even if it cannot fulfill the slot requests. So the job will keep 
> failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to