[
https://issues.apache.org/jira/browse/FLINK-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866368#comment-16866368
]
Till Rohrmann commented on FLINK-12865:
---------------------------------------
Thanks for the clarification [~gaoyunhaii]. I think you are right that we
always need to handle the case of timeouts. In this case the assumption is that
the request failed and that we need to retry. I'm not sure what else we can do
against timeouts. The only thing I can think of is to assume the success case
which should avoid that we overestimate the available set of cluster resources.
But this would require some changes to the slot allocation protocol because we
would also need to notify the allocation failure to the JM so that it can
handle it.
At the moment I would suggest that we go for the simple solution to make the
{{HeartbeatManagerImpls}} single threaded. Hence, this issue should be fixed
with the same fix as for FLINK-12863. I'm currently working on it and hope to
open a PR today or tomorrow.
> State inconsistency between RM and TM on the slot status
> --------------------------------------------------------
>
> Key: FLINK-12865
> URL: https://issues.apache.org/jira/browse/FLINK-12865
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Yun Gao
> Assignee: Yun Gao
> Priority: Major
>
> There may be state inconsistency between TM and RM due to race condition and
> message loss:
> # When TM sends heartbeat, it retrieve SlotReport in the main thread, but
> sends the heartbeat in another thread. There may be cases that the slot on TM
> is FREE initially and SlotReport read the FREE state, then RM requests slot
> and mark the slot as allocated, and the SlotReport finally override the
> allocated status at the RM side wrongly.
> # When RM requests slot, TM received the requests but the acknowledge
> message get lot. Then RM will think this slot is free.
> Both the problems may cause RM marks an ALLOCATED slot as FREE. This may
> currently cause additional retries till the state is synchronized after the
> next heartbeat, and for the inaccurate resource statistics for the
> fine-grained resource management in the future.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)