tillrohrmann commented on a change in pull request #13725:
URL: https://github.com/apache/flink/pull/13725#discussion_r509935113
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java
##########
@@ -203,6 +207,11 @@ protected void handleStateChange(ConnectionState newState)
{
}
}
+ private void onReconnectedConnectionState() {
+ // check whether we find some new leader information in
ZooKeeper
+ retrieveLeaderInformationFromZooKeeper();
Review comment:
I think you are right that in some cases we will report a stale leader
here. However, once the asynchronous fetch from the `NodeCache` completes, the
listener should be notified about the new leader. What happens in the meantime
is that the listener will try to connect to the old leader which should either
be gone or reject all connection attempts since he is no longer the leader.
The problem `notifyLossOfLeader` tried to solve is that a listener thinks
that a stale leader is still the leader and, thus, continues working for it w/o
questioning it (e.g. check with the leader) until the connection to the leader
times out. With the `notifyLossOfLeader` change, once the retrieval service
loses connection to ZooKeeper, it will tell the listener that the current
leader is no longer valid. This will tell the listener to stop working for this
leader (e.g. cancelling all tasks, disconnecting from it, etc.). If the
listener should shortly after be told that the old leader is still the leader
because of stale information, then it will first try to connect to the leader
which will fail (assuming that the old leader is indeed no longer the leader)
before it can start doing work for the leader.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]