LB-Yu commented on code in PR #1401:
URL: https://github.com/apache/fluss/pull/1401#discussion_r2281474715


##########
fluss-server/src/main/java/com/alibaba/fluss/server/coordinator/CoordinatorServer.java:
##########
@@ -157,10 +158,45 @@ public static void main(String[] args) {
 
     @Override
     protected void startServices() throws Exception {
+        electCoordinatorLeader();
+    }
+
+    private void electCoordinatorLeader() throws Exception {
+        this.zkClient = ZooKeeperUtils.startZookeeperClient(conf, this);
+
+        // Coordinator Server supports high availability. If 3 coordinator 
servers are alive,
+        // one of them will be elected as leader and the other two will be 
standby.
+        // When leader fails, one of standby coordinators will be elected as 
new leader.
+        // All of them register to ZK like tablet servers in path
+        // "/coordinators/ids/1","/coordinators/ids/2","/coordinators/ids/3".
+        // but the leader will be elected in path "/coordinators/leader" 
additionally.
+        registerCoordinatorServer();
+        ZooKeeperUtils.registerZookeeperClientReInitSessionListener(
+                zkClient, this::registerCoordinatorServer, this);
+
+        // try to register Coordinator leader once
+        if (tryElectCoordinatorLeaderOnce()) {
+            startCoordinatorLeaderService();
+        } else {
+            // standby
+            CoordinatorLeaderElection coordinatorLeaderElection =
+                    new CoordinatorLeaderElection(zkClient.getCuratorClient(), 
serverId);
+            coordinatorLeaderElection.startElectLeader(
+                    () -> {
+                        try {
+                            startCoordinatorLeaderService();
+                        } catch (Exception e) {
+                            throw new RuntimeException(e);
+                        }
+                    });
+        }

Review Comment:
   This logic looks strange to me. Why do we explicitly create the path for the 
first election, but use `LeaderLatch` for the standby Coordinator? Since we’re 
already using the framework, perhaps we could standardize on `LeaderLatch`?
   
   Also, this logic seems problematic. If cs-1 becomes the leader on the first 
startup and later loses leadership due to a network issue, it appears it will 
never participate in the election again, because no `LeaderLatch` was 
registered for the node that became leader during the first startup.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to