[ https://issues.apache.org/jira/browse/FLINK-23216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weijie Guo updated FLINK-23216: ------------------------------- Affects Version/s: 2.1.0 (was: 1.14.0) (was: 1.13.1) (was: 1.12.4) > RM keeps allocating and freeing slots after a TM lost until its heartbeat > timeout > --------------------------------------------------------------------------------- > > Key: FLINK-23216 > URL: https://issues.apache.org/jira/browse/FLINK-23216 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 2.1.0 > Reporter: Gen Luo > Priority: Major > Fix For: 2.0.0 > > > In Flink 1.13, it's observed that the ResourceManager keeps allocating and > freeing slots with a new TM when it's notified by yarn that a TM is lost. The > behavior will continue until JM marks the TM as FAILED when its heartbeat > timeout is reached. It can be easily reproduced by enlarging the > akka.ask.timeout and heartbeat.timeout, for example to 10 min. > > After tracking, we find the procedure should be like this: > When a TM is killed, yarn will first receive the event and notify the RM. > In Flink 1.13, RM uses declarative resource management to manage the slots. > It will find a lack of resources when receiving the notification, and then > request a new TM from yarn. > RM will then require the new TM to connect and offer slots to JM. > But from JM's point of view, all slots are fulfilled, since the lost TM is > not considered disconnected yet, until the heartbeat timeout is reached, so > JM will reject all slot offers. > The new TM will find no slot serving for the JM, then disconnect from the JM. > RM will then find a lack of resources again and go back to step3, requiring > the new TM to connect and offer slots to JM, but It won't request another new > TM from yarn. > > The original log is lost but is like this: > o.a.f.r.r.s.DefaultSlotStatusSyncer - Freeing slot xxx. > ...(repeat serval lines for different slots)... > o.a.f.r.r.s.DefaultSlotStatusSyncer - Starting allocation of slot xxx from > container_xxx for job xxx. > ...(repeat serval lines for different slots)... > > This could be fixed in several ways, such as notifying JM as well the RM > receives a TM lost notification, TMs do not offer slots until required, etc. > But all these ways have side effects so may need further discussion. > -- This message was sent by Atlassian Jira (v8.20.10#820010)