[
https://issues.apache.org/jira/browse/FLINK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865278#comment-16865278
]
shuai.xu commented on FLINK-12863:
----------------------------------
Hi [~till.rohrmann], as [~xiaogang.shi] said, we found the same race condition
between RM and TM, and adding a version in each slot to solve it. I think
adding fencing token to AllocatedSlotReport can solve it. But how would you
update the fencing token? When offering slots succeeds or before offering
slots? If when offering slots succeeds, it may happen that JM use the new
fencing token while TM considering the offering slots fail, so TM may not
update the token, and JM have no change to use the old token any more. If TM
updates the token before offering slots, it may happen that JM doesn't receive
the offering, so JM doesn't update the token. I think using a version may be
more suitable, as we can compare two version, the bigger version will be
correct always.
> Race condition between slot offerings and AllocatedSlotReport
> -------------------------------------------------------------
>
> Key: FLINK-12863
> URL: https://issues.apache.org/jira/browse/FLINK-12863
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Critical
> Fix For: 1.7.3, 1.9.0, 1.8.1
>
>
> With FLINK-11059 we introduced the {{AllocatedSlotReport}} which is used by
> the {{TaskExecutor}} to synchronize its internal view on slot allocations
> with the view of the {{JobMaster}}. It seems that there is a race condition
> between offering slots and receiving the report because the
> {{AllocatedSlotReport}} is sent by the {{HeartbeatManagerSenderImpl}} from a
> separate thread.
> Due to that it can happen that we generate an {{AllocatedSlotReport}} just
> before getting new slots offered. Since the report is sent from a different
> thread, it can then happen that the response to the slot offerings is sent
> earlier than the {{AllocatedSlotReport}}. Consequently, we might receive an
> outdated slot report on the {{TaskExecutor}} causing active slots to be
> released.
> In order to solve the problem I propose to add a fencing token to the
> {{AllocatedSlotReport}} which is being updated whenever we offer new slots to
> the {{JobMaster}}. When we receive the {{AllocatedSlotReport}} on the
> {{TaskExecutor}} we compare the current slot report fencing token with the
> received one and only process the report if they are equal. Otherwise we wait
> for the next heartbeat to send us an up to date {{AllocatedSlotReport}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)