[ 
https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-17994:
-----------------------------
    Affects Version/s: 1.11.0

> Fix the race condition between CheckpointBarrierUnaligner#processBarrier and 
> #notifyBarrierReceived
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-17994
>                 URL: https://issues.apache.org/jira/browse/FLINK-17994
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Zhijiang
>            Assignee: Zhijiang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.11.0, 1.12.0
>
>
> The race condition issue happens as follow:
>  * ch1 is received from network for one input channel by netty thread and 
> schedule the ch1 into mailbox via #notifyBarrierReceived
>  * ch2 is received from network for another input channel by netty thread, 
> but before calling #notifyBarrierReceived this barrier was inserted into 
> channel's data queue in advance. Then it would cause task thread process ch2 
> earlier than #notifyBarrierReceived by netty thread.
>  * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
>  * After that, the previous scheduled ch1 is performed from mailbox by task 
> thread, then it causes the IllegalArgumentException inside 
> SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the 
> initial assumption that checkpoint is executed in incremental way for aligned 
> mode.
> The key problem is that we can not remove the checkpoint action from mailbox 
> queue before the next checkpoint is going to execute now. One possible 
> solution is that we record the previous aborted checkpoint id inside 
> SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued 
> checkpoint inside mailbox is executing, it will exit directly if found the 
> checkpoint id was already aborted before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to