[jira] [Updated] (FLINK-23216) RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout

Till Rohrmann (Jira) Fri, 02 Jul 2021 08:20:08 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-23216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann updated FLINK-23216:
----------------------------------
    Issue Type: Improvement  (was: Bug)

> RM keeps allocating and freeing slots after a TM lost until its heartbeat 
> timeout
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-23216
>                 URL: https://issues.apache.org/jira/browse/FLINK-23216
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.13.1
>            Reporter: Gen Luo
>            Priority: Major
>
> In Flink 1.13, it's observed that the ResourceManager keeps allocating and 
> freeing slots with a new TM when it's notified by yarn that a TM is lost. The 
> behavior will continue until JM marks the TM as FAILED when its heartbeat 
> timeout is reached. It can be easily reproduced by enlarging the 
> akka.ask.timeout and heartbeat.timeout, for example to 10 min.
>  
> After tracking, we find the procedure should be like this:
> When a TM is killed, yarn will first receive the event and notify the RM.
> In Flink 1.13, RM uses declarative resource management to manage the slots. 
> It will find a lack of resources when receiving the notification, and then 
> request a new TM from yarn.
> RM will then require the new TM to connect and offer slots to JM.
> But from JM's point of view, all slots are fulfilled, since the lost TM is 
> not considered disconnected yet, until the heartbeat timeout is reached, so 
> JM will reject all slot offers.
> The new TM will find no slot serving for the JM, then disconnect from the JM.
> RM will then find a lack of resources again and go back to step3, requiring 
> the new TM to connect and offer slots to JM, but It won't request another new 
> TM from yarn.
>  
> The original log is lost but is like this:
> o.a.f.r.r.s.DefaultSlotStatusSyncer - Freeing slot xxx.
> ...(repeat serval lines for different slots)...
> o.a.f.r.r.s.DefaultSlotStatusSyncer - Starting allocation of slot xxx from 
> container_xxx for job xxx.
> ...(repeat serval lines for different slots)...
>  
> This could be fixed in several ways, such as notifying JM as well the RM 
> receives a TM lost notification, TMs do not offer slots until required, etc. 
> But all these ways have side effects so may need further discussion. 
> Besides, this should no longer be an issue after FLINK-23209 is done.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-23216) RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout

Reply via email to