[
https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhijiang updated FLINK-17994:
-----------------------------
Fix Version/s: 1.12.0
> Fix the race condition between CheckpointBarrierUnaligner#processBarrier and
> #notifyBarrierReceived
> ---------------------------------------------------------------------------------------------------
>
> Key: FLINK-17994
> URL: https://issues.apache.org/jira/browse/FLINK-17994
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Reporter: Zhijiang
> Assignee: Zhijiang
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.11.0, 1.12.0
>
>
> The race condition issue happens as follow:
> * ch1 is received from network for one input channel by netty thread and
> schedule the ch1 into mailbox via #notifyBarrierReceived
> * ch2 is received from network for another input channel by netty thread,
> but before calling #notifyBarrierReceived this barrier was inserted into
> channel's data queue in advance. Then it would cause task thread process ch2
> earlier than #notifyBarrierReceived by netty thread.
> * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1.
> * After that, the previous scheduled ch1 is performed from mailbox by task
> thread, then it causes the IllegalArgumentException inside
> SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the
> initial assumption that checkpoint is executed in incremental way for aligned
> mode.
> The key problem is that we can not remove the checkpoint action from mailbox
> queue before the next checkpoint is going to execute now. One possible
> solution is that we record the previous aborted checkpoint id inside
> SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued
> checkpoint inside mailbox is executing, it will exit directly if found the
> checkpoint id was already aborted before.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)