xiaolailong commented on issue #13913: URL: https://github.com/apache/dolphinscheduler/issues/13913#issuecomment-1818989174
@Radeity Hi. As you mentioned in the above, when reconnected happen, the mater can not find its self because its heartbeat information is set to empty in zk. I can not reproduce this bug, and as I see, in the MasterHeartBeatTask.java, the heartbeat information will update every 10s, so it is not keep empty all the time. I also get this bug in production environment, so I try to reproduce but I failed. can you give me some help, Thanks! > Hi, @minyk , in `MasterConnectionStateListener` of version 3.0.x, when the connection state change to `RECONNECTED`, master node will be removed and create new one. > > https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/MasterConnectionStateListener.java#L50-L54 > > > However, when creating new ephemeral node, we don't set heartBeat json as its initial value like > ```java > registryClient.persistEphemeral(masterRegistryPath, JSONUtils.toJsonString(masterHeartBeatTask.getHeartBeat())); > ``` > > Information of master nodes will only be updated when handling node add and remove event in `ServerNodeManager` > > https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L313-L329 > > **In `getServerList` of 3.0.x version, if we don't get heartBeat info, we will skip this node.** > > https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/registry/RegistryClient.java#L94-L103 > > Thus, when master2 execute `syncMasterNodes`, it can not find itself in `masterPriorityQueue`. Information of master node will not be updated any more, so it will keep writing warning message in master2. > > https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L356-L363 > > You can try to update your DS version to 3.1.x, we provide stop/waiting strategy, this bug doesn't exist :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
