[ 
https://issues.apache.org/jira/browse/FLINK-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866199#comment-16866199
 ] 

Yun Gao edited comment on FLINK-12865 at 6/18/19 3:52 AM:
----------------------------------------------------------

[~till.rohrmann] Hi Till, thanks a lot for the explanation and sorry for not 
being accurate for the second problem. Although TCP handles retransmission, we 
may encounter  AkkaAskTimeoutException before the retransmission succeeds. Then 
during the processing of the exception in the SlotManager, we mark the slot as 
FREE. However, since TM has received the requests, it will mark the slot as 
ALLOCATED, thus cause the inconsistency. But I also think that it only happens 
in rare scenarios.

For both problems, we end up with RM marking a slot as FREE while TM marking a 
slot as allocated. Currently it will not cause serious problem. The next 
heartbeat will sync the right status to RM. During this period, although RM may 
also assign this slot to other requests, the requestSlot to TM will fail with 
SlotOccupiedException due to the slot has been allocated to the previous 
request. RM can handle this exception rightly. Since It will only cause some 
addition retries and its frequency is not high, I think this issue may not be a 
blocker of 1.8.1 [~sunjincheng121], and I agree that sending heartbeats in the 
main threads is enough since it addresses most cases of FLINK-12865 and 
FLINK-12863 uniformly.

However, if in the future if we have other functions that rely on the status of 
the slots (For example, if we use bookkeeping method to maintain the remaining 
resources on each TM with fine-grained resources and update the resources 
according to the slots' status, the inconsistency may cause temporarily 
overestimating the remaining resources of some TM, and scheduling some Tasks to 
TM without enough resources), I think this issue may requires more thoughts. 
Since it is not the case for now, I also think that we would not need to take 
these cases into consideration right now. 


was (Author: gaoyunhaii):
[~till.rohrmann] Hi Till, thanks a lot for the explanation and sorry for not 
being accurate for the second problem. Although TCP handles retransmission, we 
may encounter the AkkaAskTimeoutException before the retransmission success. 
Then during the processing of the exception in the SlotManager, we mark the 
slot as FREE. However, since TM has received the requests, it will mark the 
slot as ALLOCATED, thus cause the inconsistency. But I also think that it only 
happens in rare scenarios.

For both problems, we end up with RM marking a slot as FREE while TM marking a 
slot as allocated. Currently it will not cause serious problem. The next 
heartbeat will sync the right status to RM. During this period, although RM may 
also assign this slot to other requests, the requestSlot to TM will fail with 
SlotOccupiedException due to the slot has been allocated to the previous 
request. RM can handle this exception rightly. Since It will only cause some 
addition retries and its frequency is not high, I think this issue may not be a 
blocker of 1.8.1 [~sunjincheng121], and I agree that sending heartbeats in the 
main threads is enough since it addresses most cases of FLINK-12865 and 
FLINK-12863 uniformly.

However, if in the future if we have other functions that rely on the status of 
the slots (For example, if we use bookkeeping method to maintain the remaining 
resources on each TM with fine-grained resources and update the resources 
according to the slots' status, the inconsistency may cause temporarily 
overestimating the remaining resources of some TM, and scheduling some Tasks to 
TM without enough resources), I think this issue may requires more thoughts. 
Since it is not the case for now, I also think that we would not need to take 
these cases into consideration right now. 

> State inconsistency between RM and TM on the slot status
> --------------------------------------------------------
>
>                 Key: FLINK-12865
>                 URL: https://issues.apache.org/jira/browse/FLINK-12865
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>
> There may be state inconsistency between TM and RM due to race condition and 
> message loss:
>  # When TM sends heartbeat, it retrieve SlotReport in the main thread, but 
> sends the heartbeat in another thread. There may be cases that the slot on TM 
> is FREE initially and SlotReport read the FREE state, then RM requests slot 
> and mark the slot as allocated, and the SlotReport finally override the 
> allocated status at the RM side wrongly.
>  # When RM requests slot, TM received the requests but the acknowledge 
> message get lot. Then RM will think this slot is free. 
>  Both the problems may cause RM marks an ALLOCATED slot as FREE. This may 
> currently cause additional retries till the state is synchronized after the 
> next heartbeat, and for the inaccurate resource statistics for the 
> fine-grained resource management in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to