[
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006738#comment-17006738
]
Xintong Song commented on FLINK-15456:
--------------------------------------
When Flink RM recovers previous attempt containers from Yarn after a failover,
it will not create pending slots, like what it does when requesting new TM
containers. RM only adds the recovered containers' information to its worker
map, so that later TM registrations can be accepted. Existing TMs will
proactively register to the new leader RM. That means if a TM from a recovered
container does not register to RM, it will not prevent RM from allocating new
slots.
> Job keeps failing on slot allocation timeout due to RM not allocating new TMs
> for slot requests
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-15456
> URL: https://issues.apache.org/jira/browse/FLINK-15456
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Zhu Zhu
> Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: jm_part.log, jm_part2.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are
> registered. So the job fails due to not able to acquire all 30 slots needed
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask
> for new TMs even if it cannot fulfill the slot requests. So the job will keep
> failing for slot allocation timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)