[ 
https://issues.apache.org/jira/browse/FLINK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133282#comment-17133282
 ] 

Roman Khachatryan edited comment on FLINK-18257 at 6/11/20, 2:25 PM:
---------------------------------------------------------------------

Looks like there is also a race condition when checking discarded because it's 
being read and written by different threads (timer & ioExecutor).

 

Edit:

I missed the previous comment. Because of the race I think we still have an 
issue:
 # ioExecutor acquires checkpoint.lock, 
 # timer thread acquires CheckpointCoordinator.lock and reads discarded == 
false 
 # ioExecutor writes discarded = true
 # timer doesn't complete the future.

Apart from discarded, notYetAcknowledgedMasterStates is also not thread safe.


was (Author: roman_khachatryan):
Looks like there is also a race condition when checking discarded because it's 
being read and written by different threads (timer & ioExecutor).

> MasterStateSnapshot future may not be completed
> -----------------------------------------------
>
>                 Key: FLINK-18257
>                 URL: https://issues.apache.org/jira/browse/FLINK-18257
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Roman Khachatryan
>            Assignee: Roman Khachatryan
>            Priority: Blocker
>             Fix For: 1.11.0
>
>
> From 
> https://issues.apache.org/jira/browse/FLINK-18137?focusedCommentId=17133144&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17133144:
> There are several cases when masterStateCompletableFuture can be left 
> incomplete:
>  # checkpoint is discarded (aborted) - line 690 throws an exception instead 
> of completing the future
>  # checkpoint is discarded (aborted) - line 696 doesn't complete the future 
> even if everything is acked (what  has [~trohrmann] found)
>  # CompletableFuture.allOf waits for both masterStates and 
> coordinatorCheckpoints futures while it could "return" as soon as one fails
> Need to check first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to