[
https://issues.apache.org/jira/browse/FLINK-26284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496221#comment-17496221
]
Matthias Pohl commented on FLINK-26284:
---------------------------------------
We still had trouble running the check for locks and marking-for-deletion in an
atomic manner. Therefore, [~dmvk] proposed a different solution which worked in
the end: A {{/locks}} subfolder was introduced that holds all the locks of open
ZK connections. Deleting this subfolder indicates that the node is ready for
deletion. This operation fails if there are still connections open.
This approach allows us to do the connection check and mark-for-deletion
operation in an atomic manner.
> The ZooKeeperStateHandleStore cleans the metadata before cleaning the
> StateHandle
> ---------------------------------------------------------------------------------
>
> Key: FLINK-26284
> URL: https://issues.apache.org/jira/browse/FLINK-26284
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Assignee: Matthias Pohl
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.15.0
>
>
> Cleanup of job state does not work properly in an HA setup.
> {{releaseAndTryRemove}} deletes the meta data stored in the store before
> cleaning up the {{StateHandle}}. If the {{StateHandle}} cleanup fails after
> the reference is already deleted in the {{StateHandleStore}}, a cleanup retry
> will constantly fail because it cannot deserialize the {{StateHandle}}
> anymore.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)