Gen Luo created FLINK-23216:
-------------------------------
Summary: RM keeps allocating and freeing slots after a TM lost
until its heartbeat timeout
Key: FLINK-23216
URL: https://issues.apache.org/jira/browse/FLINK-23216
Project: Flink
Issue Type: Bug
Affects Versions: 1.13.1
Reporter: Gen Luo
In Flink 1.13, it's observed that the ResourceManager keeps allocating and
freeing slots with a new TM when it's notified by yarn that a TM is lost. The
behavior will continue until JM marks the TM as FAILED when its heartbeat
timeout is reached. It can be easily reproduced by enlarging the
akka.ask.timeout and heartbeat.timeout, for example to 10 min.
After tracking, we find the procedure should be like this:
When a TM is killed, yarn will first receive the event and notify the RM.
In Flink 1.13, RM uses declarative resource management to manage the slots. It
will find a lack of resources when receiving the notification, and then request
a new TM from yarn.
RM will then require the new TM to connect and offer slots to JM.
But from JM's point of view, all slots are fulfilled, since the lost TM is not
considered disconnected yet, until the heartbeat timeout is reached, so JM will
reject all slot offers.
The new TM will find no slot serving for the JM, then disconnect from the JM.
RM will then find a lack of resources again and go back to step3, requiring the
new TM to connect and offer slots to JM, but It won't request another new TM
from yarn.
The original log is lost but is like this:
o.a.f.r.r.s.DefaultSlotStatusSyncer - Freeing slot xxx.
...(repeat serval lines for different slots)...
o.a.f.r.r.s.DefaultSlotStatusSyncer - Starting allocation of slot xxx from
container_xxx for job xxx.
...(repeat serval lines for different slots)...
This could be fixed in several ways, such as notifying JM as well the RM
receives a TM lost notification, TMs do not offer slots until required, etc.
But all these ways have side effects so may need further discussion.
Besides, this should no longer be an issue after FLINK-23209 is done.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)