[ 
https://issues.apache.org/jira/browse/FLINK-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-21751:
-----------------------------------
    Labels: pull-request-available  (was: )

> Improve handling of freed slots if final requirement message is in flight
> -------------------------------------------------------------------------
>
>                 Key: FLINK-21751
>                 URL: https://issues.apache.org/jira/browse/FLINK-21751
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Chesnay Schepler
>            Assignee: Chesnay Schepler
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0
>
>
> When a job shuts down there is a race condition between slots being freed and 
> requirements being set to 0. If the slot release arrives first at the RM then 
> it will immediately try to re-allocate slots, since the requirements are not 
> 0 yet.
> In practice this is unlikely to cause issues (because the trip from 
> JobMaster->TM->RM should always take longer than JobMaster->RM), but this 
> problem results in various test stabilities.
> Essentially there are 2 alternatives:
> a) enforce a strict order such that the requirement update must be 
> acknowledged before slots are freed
> b) have the RM inform the TM if the job has finished, to clean up any pending 
> slots.
> Both options are not ideal.
> a) implies that the JobMaster has to stick around longer to wait for the 
> acknowledge and this also introduces a delay to all slot freeing operations.
> b) can easily lead to bugs in the future; if the TM was informed that the job 
> has concluded it must only cancel pending slots; it may not free all job 
> resources because other messages from the JM may still be in flight (for 
> example, the partition promotions).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to