XComp commented on a change in pull request #18869:
URL: https://github.com/apache/flink/pull/18869#discussion_r814118593



##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java
##########
@@ -375,7 +415,7 @@ public IntegerResourceVersion exists(String 
pathInZooKeeper) throws Exception {
                     try {
                         final RetrievableStateHandle<T> stateHandle = 
getAndLock(path);
                         stateHandles.add(new Tuple2<>(stateHandle, path));
-                    } catch (KeeperException.NoNodeException ignored) {
+                    } catch (NotExistException ignored) {

Review comment:
       Till's comment on that:
   > I think the problem before was that when a JM lost its leadership then we 
only suspended it w/o releasing all resources. Now, the JM will shut down once 
it loses leadership. If it regains the leadership, then it will be restarted 
and read the state from ZooKeeper again. Due to this, I think the locking 
mechanism is probably no longer needed (assuming that no two leader sessions 
can overlap).
   
   So that's kind of unrelated to this bug. The `getAllAndLock` is only used in 
the context of recovering checkpoints. This only happens during a failover in 
which case checkpoints are not deleted. The process that lost leadership would 
just release the checkpoints.
   
   There is another code branch where (old) checkpoints are deleted when adding 
a new checkpoint. I haven't found the check, but I assume, only the leader is 
able to create checkpoints. Hence, there shouldn't be a clash.
   
   But the fix is still needed to make it consistent.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to