[
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006719#comment-17006719
]
Zhu Zhu commented on FLINK-15456:
---------------------------------
Thanks for the explanation [~xintongsong].
I still have one question about this issue. In {{jm_part2.log}}, I found the
job recovered on a failover triggered by a RM leadership lost (around
03:08:44). After that the RM did ask for a new TM for slot requests so that the
job recovered. Does that mean the pending TM was abandoned in this case?
> Job keeps failing on slot allocation timeout due to RM not allocating new TMs
> for slot requests
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-15456
> URL: https://issues.apache.org/jira/browse/FLINK-15456
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Zhu Zhu
> Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: jm_part.log, jm_part2.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are
> registered. So the job fails due to not able to acquire all 30 slots needed
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask
> for new TMs even if it cannot fulfill the slot requests. So the job will keep
> failing for slot allocation timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)