[
https://issues.apache.org/jira/browse/FLINK-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Pohl resolved FLINK-26987.
-----------------------------------
Resolution: Fixed
master: cda343349f5d2b080218b7fe1993794a5a16c272
1.15: aa3bb951db745f94070f2ef6ecb62ce207bda520
> ZooKeeperStateHandleStore.getAllAndLock ends up in a infinite loop if there's
> an entry marked for deletion that's not cleaned up, yet
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-26987
> URL: https://issues.apache.org/jira/browse/FLINK-26987
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0, 1.16.0
> Reporter: Matthias Pohl
> Assignee: Matthias Pohl
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.15.0
>
>
> {{ZooKeeperStateHandleStore.getAllAndLock}} is used when recovering
> {{CompletedCheckpoints}}. It iterates over all childs and retries until it
> reaches a stable and consistent version (i.e. no entries are subject for
> deletion and no child nodes were added while accessing the ZK instance).
> Additionally, {{ZooKeeperStateHandleStore}} marks entries for deletion
> internally before actually deleting them. This can lead to a state where an
> entry is marked for deletion but the discard failed causing the cleanup to
> fail. The entry will be left marked for deletion and another cleanup will be
> tried. This works infinitely. But the users has the ability to limit the
> amount of retries. In that case, the entry might be left marked.
> Restarting Flink cluster will now try to access this
> ZooKeeperStateHandleStore recovering the checkpoints with this entry still
> being marked for deletion which will cause an error in
> [ZooKeeperStateHandleStore.getAllAndLock|https://github.com/apache/flink/blob/c3df4c3f1f868d40e1e70404bea41b7a007e8b08/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L413]
> which results in a retry loop that's not desired.
> We actually don't need to retry in that case because the child can be
> ignored, as far as I can see.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)