xianjingfeng commented on code in PR #938:
URL: https://github.com/apache/incubator-uniffle/pull/938#discussion_r1225710556
##########
coordinator/src/main/java/org/apache/uniffle/coordinator/SimpleClusterManager.java:
##########
@@ -121,35 +129,39 @@ public SimpleClusterManager(CoordinatorConf conf,
Configuration hadoopConf) thro
void nodesCheck() {
try {
long timestamp = System.currentTimeMillis();
- Set<String> deleteIds = Sets.newHashSet();
- Set<String> unhealthyNode = Sets.newHashSet();
for (ServerNode sn : servers.values()) {
if (timestamp - sn.getTimestamp() > heartbeatTimeout) {
LOG.warn("Heartbeat timeout detect, " + sn + " will be removed from
node list.");
- deleteIds.add(sn.getId());
+ sn.setStatus(ServerStatus.LOST);
+ lostNodes.add(sn);
+ unhealthyNodes.remove(sn);
} else if (!sn.isHealthy()) {
LOG.warn("Found server {} was unhealthy, will not assign it.", sn);
- unhealthyNode.add(sn.getId());
+ unhealthyNodes.add(sn);
Review Comment:
Should we remove the node from `lostNodes` after this line?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]