XComp commented on a change in pull request #18869:
URL: https://github.com/apache/flink/pull/18869#discussion_r814118593
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/zookeeper/ZooKeeperStateHandleStore.java
##########
@@ -375,7 +415,7 @@ public IntegerResourceVersion exists(String
pathInZooKeeper) throws Exception {
try {
final RetrievableStateHandle<T> stateHandle =
getAndLock(path);
stateHandles.add(new Tuple2<>(stateHandle, path));
- } catch (KeeperException.NoNodeException ignored) {
+ } catch (NotExistException ignored) {
Review comment:
Till's comment on that:
> I think the problem before was that when a JM lost its leadership then we
only suspended it w/o releasing all resources. Now, the JM will shut down once
it loses leadership. If it regains the leadership, then it will be restarted
and read the state from ZooKeeper again. Due to this, I think the locking
mechanism is probably no longer needed (assuming that no two leader sessions
can overlap).
So that's kind of unrelated to this bug. The `getAllAndLock` is only used in
the context of recovering checkpoints. This only happens during a failover in
which case checkpoints are not deleted. The process that lost leadership would
just release the checkpoints.
There is another code branch where (old) checkpoints are deleted when adding
a new checkpoint. I haven't found the check, but I assume, only the leader is
able to create checkpoints. Hence, there shouldn't be a clash.
But the fix is still needed to make it consistent.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]