[
https://issues.apache.org/jira/browse/FLINK-23216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-23216:
----------------------------------
Issue Type: Improvement (was: Bug)
> RM keeps allocating and freeing slots after a TM lost until its heartbeat
> timeout
> ---------------------------------------------------------------------------------
>
> Key: FLINK-23216
> URL: https://issues.apache.org/jira/browse/FLINK-23216
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 1.13.1
> Reporter: Gen Luo
> Priority: Major
>
> In Flink 1.13, it's observed that the ResourceManager keeps allocating and
> freeing slots with a new TM when it's notified by yarn that a TM is lost. The
> behavior will continue until JM marks the TM as FAILED when its heartbeat
> timeout is reached. It can be easily reproduced by enlarging the
> akka.ask.timeout and heartbeat.timeout, for example to 10 min.
>
> After tracking, we find the procedure should be like this:
> When a TM is killed, yarn will first receive the event and notify the RM.
> In Flink 1.13, RM uses declarative resource management to manage the slots.
> It will find a lack of resources when receiving the notification, and then
> request a new TM from yarn.
> RM will then require the new TM to connect and offer slots to JM.
> But from JM's point of view, all slots are fulfilled, since the lost TM is
> not considered disconnected yet, until the heartbeat timeout is reached, so
> JM will reject all slot offers.
> The new TM will find no slot serving for the JM, then disconnect from the JM.
> RM will then find a lack of resources again and go back to step3, requiring
> the new TM to connect and offer slots to JM, but It won't request another new
> TM from yarn.
>
> The original log is lost but is like this:
> o.a.f.r.r.s.DefaultSlotStatusSyncer - Freeing slot xxx.
> ...(repeat serval lines for different slots)...
> o.a.f.r.r.s.DefaultSlotStatusSyncer - Starting allocation of slot xxx from
> container_xxx for job xxx.
> ...(repeat serval lines for different slots)...
>
> This could be fixed in several ways, such as notifying JM as well the RM
> receives a TM lost notification, TMs do not offer slots until required, etc.
> But all these ways have side effects so may need further discussion.
> Besides, this should no longer be an issue after FLINK-23209 is done.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)