[jira] [Created] (FLINK-23216) RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout

Gen Luo (Jira) Thu, 01 Jul 2021 23:53:06 -0700

Gen Luo created FLINK-23216:
-------------------------------

             Summary: RM keeps allocating and freeing slots after a TM lost 
until its heartbeat timeout
                 Key: FLINK-23216
                 URL: https://issues.apache.org/jira/browse/FLINK-23216
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.13.1
            Reporter: Gen Luo



In Flink 1.13, it's observed that the ResourceManager keeps allocating and 
freeing slots with a new TM when it's notified by yarn that a TM is lost. The 
behavior will continue until JM marks the TM as FAILED when its heartbeat 
timeout is reached. It can be easily reproduced by enlarging the 
akka.ask.timeout and heartbeat.timeout, for example to 10 min.

 

After tracking, we find the procedure should be like this:

When a TM is killed, yarn will first receive the event and notify the RM.

In Flink 1.13, RM uses declarative resource management to manage the slots. It 
will find a lack of resources when receiving the notification, and then request 
a new TM from yarn.

RM will then require the new TM to connect and offer slots to JM.

But from JM's point of view, all slots are fulfilled, since the lost TM is not 
considered disconnected yet, until the heartbeat timeout is reached, so JM will 
reject all slot offers.

The new TM will find no slot serving for the JM, then disconnect from the JM.

RM will then find a lack of resources again and go back to step3, requiring the 
new TM to connect and offer slots to JM, but It won't request another new TM 
from yarn.

 

The original log is lost but is like this:

o.a.f.r.r.s.DefaultSlotStatusSyncer - Freeing slot xxx.

...(repeat serval lines for different slots)...

o.a.f.r.r.s.DefaultSlotStatusSyncer - Starting allocation of slot xxx from 
container_xxx for job xxx.

...(repeat serval lines for different slots)...

 

This could be fixed in several ways, such as notifying JM as well the RM 
receives a TM lost notification, TMs do not offer slots until required, etc. 
But all these ways have side effects so may need further discussion. 

Besides, this should no longer be an issue after FLINK-23209 is done.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-23216) RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout

Reply via email to