[ 
https://issues.apache.org/jira/browse/FLINK-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866148#comment-16866148
 ] 

Xiaogang Shi commented on FLINK-12865:
--------------------------------------

[~till.rohrmann]You are right that there is no problem with the postponed 
handling of slot requests. I revisited the code and found that we do use ask to 
send heartbeat requests, but the responses are not sent back to 
{{PromiseActorRef}}. Instead, they are sent back directly to RM with a separate 
RPC method. So the handling of the heartbeat reponses will not be postponed. 

After revisiting the code, it seems sending heartbeats in the main thread will 
fix the problem.

Thanks a lot for your explanation and sorry for my misleading information.

> State inconsistency between RM and TM on the slot status
> --------------------------------------------------------
>
>                 Key: FLINK-12865
>                 URL: https://issues.apache.org/jira/browse/FLINK-12865
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>
> There may be state inconsistency between TM and RM due to race condition and 
> message loss:
>  # When TM sends heartbeat, it retrieve SlotReport in the main thread, but 
> sends the heartbeat in another thread. There may be cases that the slot on TM 
> is FREE initially and SlotReport read the FREE state, then RM requests slot 
> and mark the slot as allocated, and the SlotReport finally override the 
> allocated status at the RM side wrongly.
>  # When RM requests slot, TM received the requests but the acknowledge 
> message get lot. Then RM will think this slot is free. 
>  Both the problems may cause RM marks an ALLOCATED slot as FREE. This may 
> currently cause additional retries till the state is synchronized after the 
> next heartbeat, and for the inaccurate resource statistics for the 
> fine-grained resource management in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to