[jira] [Commented] (FLINK-12863) Race condition between slot offerings and AllocatedSlotReport

Xiaogang Shi (JIRA) Sun, 16 Jun 2019 19:25:15 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865239#comment-16865239
 ]


Xiaogang Shi commented on FLINK-12863:
--------------------------------------

I think a similar problem happens in the heartbeats between RM and TM. 

When a RM receives a slot request from JM, it will find an available slot, mark 
it as pending, and send a slot request to TM. In the cases where the slot 
request is following a heartbeat request, RM will receive the heartbeat 
response first and will remove the pending slot. RM may reuse the slot when it 
receives a new slot request from JM, leading to duplicated slot allocation. 

A solution proposed by [~yungao.gy] is using version numbers. Each slot is 
equipped with a version number, which is increased once a new pending request 
is generated. These version numbers then are attached to the heartbeats sent to 
TM. Once a heartbeat response is received, we don't need to remove those 
pending slot requests whose version numbers are greater than those of 
heartbeats.

I think the solution can also work here. What do you think? 
[~yungao.gy][~till.rohrmann]

> Race condition between slot offerings and AllocatedSlotReport
> -------------------------------------------------------------
>
>                 Key: FLINK-12863
>                 URL: https://issues.apache.org/jira/browse/FLINK-12863
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.7.3, 1.9.0, 1.8.1
>
>
> With FLINK-11059 we introduced the {{AllocatedSlotReport}} which is used by 
> the {{TaskExecutor}} to synchronize its internal view on slot allocations 
> with the view of the {{JobMaster}}. It seems that there is a race condition 
> between offering slots and receiving the report because the 
> {{AllocatedSlotReport}} is sent by the {{HeartbeatManagerSenderImpl}} from a 
> separate thread. 
> Due to that it can happen that we generate an {{AllocatedSlotReport}} just 
> before getting new slots offered. Since the report is sent from a different 
> thread, it can then happen that the response to the slot offerings is sent 
> earlier than the {{AllocatedSlotReport}}. Consequently, we might receive an 
> outdated slot report on the {{TaskExecutor}} causing active slots to be 
> released.
> In order to solve the problem I propose to add a fencing token to the 
> {{AllocatedSlotReport}} which is being updated whenever we offer new slots to 
> the {{JobMaster}}. When we receive the {{AllocatedSlotReport}} on the 
> {{TaskExecutor}} we compare the current slot report fencing token with the 
> received one and only process the report if they are equal. Otherwise we wait 
> for the next heartbeat to send us an up to date {{AllocatedSlotReport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12863) Race condition between slot offerings and AllocatedSlotReport

Reply via email to