[ 
https://issues.apache.org/jira/browse/FLINK-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chesnay Schepler updated FLINK-21751:
-------------------------------------
    Description: 
When a job shuts down there is a race condition between slots being freed and 
requirements being set to 0. If the slot release arrives first at the RM then 
it will immediately try to re-allocate slots, since the requirements are not 0 
yet.

In practice this is unlikely to cause issues (because the trip from 
JobMaster->TM->RM should always take longer than JobMaster->RM), but this 
problem results in various test stabilities.

Essentially there are 2 alternatives:
a) enforce a strict order such that the requirement update must be acknowledged 
before slots are freed
b) have the RM inform the TM if the job has finished, to clean up any pending 
slots.

Both options are not ideal.
a) implies that the JobMaster has to stick around longer to wait for the 
acknowledge and this also introduces a delay to all slot freeing operations.
b) can easily lead to bugs in the future; if the TM was informed that the job 
has concluded it must only cancel pending slots; it may not free all job 
resources because other messages from the JM may still be in flight (for 
example, the partition promotions).

> Improve handling of freed slots if final requirement message is in flight
> -------------------------------------------------------------------------
>
>                 Key: FLINK-21751
>                 URL: https://issues.apache.org/jira/browse/FLINK-21751
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Chesnay Schepler
>            Assignee: Chesnay Schepler
>            Priority: Major
>             Fix For: 1.13.0
>
>
> When a job shuts down there is a race condition between slots being freed and 
> requirements being set to 0. If the slot release arrives first at the RM then 
> it will immediately try to re-allocate slots, since the requirements are not 
> 0 yet.
> In practice this is unlikely to cause issues (because the trip from 
> JobMaster->TM->RM should always take longer than JobMaster->RM), but this 
> problem results in various test stabilities.
> Essentially there are 2 alternatives:
> a) enforce a strict order such that the requirement update must be 
> acknowledged before slots are freed
> b) have the RM inform the TM if the job has finished, to clean up any pending 
> slots.
> Both options are not ideal.
> a) implies that the JobMaster has to stick around longer to wait for the 
> acknowledge and this also introduces a delay to all slot freeing operations.
> b) can easily lead to bugs in the future; if the TM was informed that the job 
> has concluded it must only cancel pending slots; it may not free all job 
> resources because other messages from the JM may still be in flight (for 
> example, the partition promotions).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to