[
https://issues.apache.org/jira/browse/FLINK-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-21751:
-----------------------------------
Labels: pull-request-available (was: )
> Improve handling of freed slots if final requirement message is in flight
> -------------------------------------------------------------------------
>
> Key: FLINK-21751
> URL: https://issues.apache.org/jira/browse/FLINK-21751
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Chesnay Schepler
> Assignee: Chesnay Schepler
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.13.0
>
>
> When a job shuts down there is a race condition between slots being freed and
> requirements being set to 0. If the slot release arrives first at the RM then
> it will immediately try to re-allocate slots, since the requirements are not
> 0 yet.
> In practice this is unlikely to cause issues (because the trip from
> JobMaster->TM->RM should always take longer than JobMaster->RM), but this
> problem results in various test stabilities.
> Essentially there are 2 alternatives:
> a) enforce a strict order such that the requirement update must be
> acknowledged before slots are freed
> b) have the RM inform the TM if the job has finished, to clean up any pending
> slots.
> Both options are not ideal.
> a) implies that the JobMaster has to stick around longer to wait for the
> acknowledge and this also introduces a delay to all slot freeing operations.
> b) can easily lead to bugs in the future; if the TM was informed that the job
> has concluded it must only cancel pending slots; it may not free all job
> resources because other messages from the JM may still be in flight (for
> example, the partition promotions).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)