[ 
https://issues.apache.org/jira/browse/FLINK-26284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495898#comment-17495898
 ] 

Matthias Pohl commented on FLINK-26284:
---------------------------------------

Several options were considered:
* Have an additional flag for deletion as a child ZK node: This is problematic 
because of the transactional deletion of the already existing access child 
lock. We're not supposed to delete the ZK node if there's a lock on it due to 
some other client accessing it. Hence, we would have to check for the existince 
of the access child lock, and delete everything if it doesn't exist or delete 
only the deletion-flag child node if it does.
* Alternatively, we could model the deletion-flag node on the same level as the 
StateHandle node. This makes it possible to run the transactional delete on the 
StateHandle node and clear the deletion-flag node afterwards.
* Move the deletion marker into the {{RetrievableStateHandle}}. It has the 
benefit that it can be used by the {{KubernetesStateHandleStore}} (FLINK-26286) 
as well but has the flaw that it requires a deserialization which is a problem 
in cases where the data was corrupted. But in these cases, we want to delete 
the data anyway.

> The ZooKeeperStateHandleStore cleans the metadata before cleaning the 
> StateHandle
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-26284
>                 URL: https://issues.apache.org/jira/browse/FLINK-26284
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Blocker
>
> Cleanup of job state does not work properly in an HA setup. 
> {{releaseAndTryRemove}} deletes the meta data stored in the store before 
> cleaning up the {{StateHandle}}. If the {{StateHandle}} cleanup fails after 
> the reference is already deleted in the {{StateHandleStore}}, a cleanup retry 
> will constantly fail because it cannot deserialize the {{StateHandle}} 
> anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to