[
https://issues.apache.org/jira/browse/FLINK-38347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019499#comment-18019499
]
Zakelly Lan commented on FLINK-38347:
-------------------------------------
Problematic steps:
# Job start
# The first checkpoint complete, but the complete notification lost.
# The second checkpoint triggered, and one subtask did not receive that, but
receive the abort notification.
# Job FO, the directory was deleted by reference counting (checkpoint 1 start
& checkpoint 2 abort)
> Checkpoint file-merging manager may delete the directory unexpectedly when
> some RPC messages lost
> -------------------------------------------------------------------------------------------------
>
> Key: FLINK-38347
> URL: https://issues.apache.org/jira/browse/FLINK-38347
> Project: Flink
> Issue Type: Bug
> Affects Versions: 2.0.0, 1.20.2, 2.1.0
> Reporter: Zakelly Lan
> Assignee: Zakelly Lan
> Priority: Major
>
> In FLINK-32086, we delete the orphan directories created by file-merging
> manager. The orphan check depends on the checkpoint notifications. So we
> should tolerate rpc messages lost, but current implementation using reference
> counting which does not verify the message completeness using checkpoint id.
> That may cause unexpected directory deletion, although this is rare.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)