[
https://issues.apache.org/jira/browse/FLINK-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann closed FLINK-12865.
---------------------------------
Resolution: Fixed
Fix Version/s: 1.9.0
1.8.1
1.7.3
Fixed via
1.9.0: a95dac57ef0e1949fd4751ca19350da96c3bf52f
1.8.1: 55c8a69cfa4d40ef2863987eb89adb08f0c45dda
1.7.3: 7333b619fdf3443e179b3f6e8d3147ab4946f91c
> State inconsistency between RM and TM on the slot status
> --------------------------------------------------------
>
> Key: FLINK-12865
> URL: https://issues.apache.org/jira/browse/FLINK-12865
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Yun Gao
> Assignee: Till Rohrmann
> Priority: Major
> Fix For: 1.7.3, 1.8.1, 1.9.0
>
>
> There may be state inconsistency between TM and RM due to race condition and
> message loss:
> # When TM sends heartbeat, it retrieve SlotReport in the main thread, but
> sends the heartbeat in another thread. There may be cases that the slot on TM
> is FREE initially and SlotReport read the FREE state, then RM requests slot
> and mark the slot as allocated, and the SlotReport finally override the
> allocated status at the RM side wrongly.
> # When RM requests slot, TM received the requests but the acknowledge
> message get lot. Then RM will think this slot is free.
> Both the problems may cause RM marks an ALLOCATED slot as FREE. This may
> currently cause additional retries till the state is synchronized after the
> next heartbeat, and for the inaccurate resource statistics for the
> fine-grained resource management in the future.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)