[ 
https://issues.apache.org/jira/browse/FLINK-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380252#comment-17380252
 ] 

Xintong Song commented on FLINK-21751:
--------------------------------------

Another instance:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20396&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=2390

[~chesnay], do you think the {{RecipientUnreachableException}} is still 
expected even with this ticket taking effect? If yes, we can simply add it to 
the log checking whitelist.

> Improve handling of freed slots if final requirement message is in flight
> -------------------------------------------------------------------------
>
>                 Key: FLINK-21751
>                 URL: https://issues.apache.org/jira/browse/FLINK-21751
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Chesnay Schepler
>            Assignee: Chesnay Schepler
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0
>
>
> When a job shuts down there is a race condition between slots being freed and 
> requirements being set to 0. If the slot release arrives first at the RM then 
> it will immediately try to re-allocate slots, since the requirements are not 
> 0 yet.
> In practice this is unlikely to cause issues (because the trip from 
> JobMaster->TM->RM should always take longer than JobMaster->RM), but this 
> problem results in various test stabilities.
> Essentially there are 2 alternatives:
> a) enforce a strict order such that the requirement update must be 
> acknowledged before slots are freed
> b) have the RM inform the TM if the job has finished, to clean up any pending 
> slots.
> Both options are not ideal.
> a) implies that the JobMaster has to stick around longer to wait for the 
> acknowledge and this also introduces a delay to all slot freeing operations.
> b) can easily lead to bugs in the future; if the TM was informed that the job 
> has concluded it must only cancel pending slots; it may not free all job 
> resources because other messages from the JM may still be in flight (for 
> example, the partition promotions).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to