[jira] [Resolved] (FLINK-26987) ZooKeeperStateHandleStore.getAllAndLock ends up in a infinite loop if there's an entry marked for deletion that's not cleaned up, yet

Matthias Pohl (Jira) Sun, 03 Apr 2022 22:39:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias Pohl resolved FLINK-26987.
-----------------------------------
    Resolution: Fixed

master: cda343349f5d2b080218b7fe1993794a5a16c272
1.15: aa3bb951db745f94070f2ef6ecb62ce207bda520

> ZooKeeperStateHandleStore.getAllAndLock ends up in a infinite loop if there's 
> an entry marked for deletion that's not cleaned up, yet
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26987
>                 URL: https://issues.apache.org/jira/browse/FLINK-26987
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> {{ZooKeeperStateHandleStore.getAllAndLock}} is used when recovering 
> {{CompletedCheckpoints}}. It iterates over all childs and retries until it 
> reaches a stable and consistent version (i.e. no entries are subject for 
> deletion and no child nodes were added while accessing the ZK instance).
> Additionally, {{ZooKeeperStateHandleStore}} marks entries for deletion 
> internally before actually deleting them. This can lead to a state where an 
> entry is marked for deletion but the discard failed causing the cleanup to 
> fail. The entry will be left marked for deletion and another cleanup will be 
> tried. This works infinitely. But the users has the ability to limit the 
> amount of retries. In that case, the entry might be left marked.
> Restarting Flink cluster will now try to access this 
> ZooKeeperStateHandleStore recovering the checkpoints with this entry still 
> being marked for deletion which will cause an error in 
> [ZooKeeperStateHandleStore.getAllAndLock|https://github.com/apache/flink/blob/c3df4c3f1f868d40e1e70404bea41b7a007e8b08/flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java#L413]
>  which results in a retry loop that's not desired.
> We actually don't need to retry in that case because the child can be 
> ignored, as far as I can see.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (FLINK-26987) ZooKeeperStateHandleStore.getAllAndLock ends up in a infinite loop if there's an entry marked for deletion that's not cleaned up, yet

Reply via email to