[ 
https://issues.apache.org/jira/browse/FLINK-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865447#comment-16865447
 ] 

sunjincheng edited comment on FLINK-12865 at 6/17/19 8:58 AM:
--------------------------------------------------------------

Hi [~gaoyunhaii], Thanks for report this issue and help to fix it! :)

I want to know is there any abnormal information? If I understand correctly 
that it should not happen frequently. right?

The reason I asked this question is that I want the evaluator to be a blocker 
released in 1.8.1.  If so, we need to fix it as soon as possible and mark it as 
Critical.


was (Author: sunjincheng121):
Is there any abnormal information? If I understand correctly that it should not 
happen frequently. right?

The reason I asked this question is that I want the evaluator to be a blocker 
released in 1.8.1.  If so, we need to fix it as soon as possible and mark it as 
Critical.

> State inconsistency between RM and TM on the slot status
> --------------------------------------------------------
>
>                 Key: FLINK-12865
>                 URL: https://issues.apache.org/jira/browse/FLINK-12865
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>
> There may be state inconsistency between TM and RM due to race condition and 
> message loss:
>  # When TM sends heartbeat, it retrieve SlotReport in the main thread, but 
> sends the heartbeat in another thread. There may be cases that the slot on TM 
> is FREE initially and SlotReport read the FREE state, then RM requests slot 
> and mark the slot as allocated, and the SlotReport finally override the 
> allocated status at the RM side wrongly.
>  # When RM requests slot, TM received the requests but the acknowledge 
> message get lot. Then RM will think this slot is free. 
>  Both the problems may cause RM marks an ALLOCATED slot as FREE. This may 
> currently cause additional retries till the state is synchronized after the 
> next heartbeat, and for the inaccurate resource statistics for the 
> fine-grained resource management in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to