dlmarion commented on code in PR #6049:
URL: https://github.com/apache/accumulo/pull/6049#discussion_r2682297045


##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1190,15 +1194,31 @@ private List<TabletMigration> 
checkMigrationSanity(Set<TabletServerId> current,
               > MAX_BAD_STATUS_COUNT) {
             if (shutdownServerRateLimiter.tryAcquire()) {
               log.warn("attempting to stop {}", server);
+              if (forceHaltingEnabled
+                  && (haltedServers.computeIfAbsent(server, s -> new 
AtomicInteger(0))
+                      .incrementAndGet() > maxTserverHalts)) {
+                log.warn("tserver {} is not responding to halt requests, 
deleting zlock", server);
+                var zk = getContext().getZooReaderWriter();
+                var iid = getContext().getInstanceID();
+                String tserversPath = Constants.ZROOT + "/" + iid + 
Constants.ZTSERVERS;
+                try {
+                  ServiceLock.deleteLocks(zk, tserversPath, 
server.getHostAndPort()::equals,
+                      log::info, false);
+                } catch (KeeperException | InterruptedException e) {
+                  log.error("Failed to delete zlock for server {}", server);

Review Comment:
   Would probably be useful to have the exception in the log as well.
   
   
   ```suggestion
                     log.error("Failed to delete zlock for server {}", server, 
e);
   ```



##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1190,15 +1194,31 @@ private List<TabletMigration> 
checkMigrationSanity(Set<TabletServerId> current,
               > MAX_BAD_STATUS_COUNT) {
             if (shutdownServerRateLimiter.tryAcquire()) {
               log.warn("attempting to stop {}", server);
+              if (forceHaltingEnabled
+                  && (haltedServers.computeIfAbsent(server, s -> new 
AtomicInteger(0))
+                      .incrementAndGet() > maxTserverHalts)) {
+                log.warn("tserver {} is not responding to halt requests, 
deleting zlock", server);
+                var zk = getContext().getZooReaderWriter();
+                var iid = getContext().getInstanceID();
+                String tserversPath = Constants.ZROOT + "/" + iid + 
Constants.ZTSERVERS;
+                try {
+                  ServiceLock.deleteLocks(zk, tserversPath, 
server.getHostAndPort()::equals,
+                      log::info, false);
+                } catch (KeeperException | InterruptedException e) {
+                  log.error("Failed to delete zlock for server {}", server);
+                }
+                haltedServers.remove(server);
+              }

Review Comment:
   Wondering if this close should be an `else` clause instead. Without the 
`else`, this will continue and try to execute the `halt` RPC after deleting the 
lock.



##########
core/src/main/java/org/apache/accumulo/core/conf/Property.java:
##########
@@ -450,6 +450,10 @@ public enum Property {
       "The number of threads used to run fault-tolerant executions (FATE)."
           + " These are primarily table operations like merge.",
       "1.4.3"),
+  MANAGER_MAX_TSERVER_HALTS("manager.max.tservers.halts", "0", 
PropertyType.COUNT,
+      "Allows the manager to force tserver halting by setting the max number 
of attempted tserver halt "
+          + " requests before deleting the tserver's zlock. A value of zero 
(default) disables this feature.",
+      "2.1.5"),

Review Comment:
   You don't have to apply this suggestion, was just thinking of different 
naming
   
   ```suggestion
     MANAGER_TSERVER_HALT_ATTEMPTS("manager.tservers.halt.attempts", "0", 
PropertyType.COUNT,
         "Allows the manager to remove the tserver lock in zookeeper after this 
many failed RPC halt attempts. A value of zero (default) disables this 
feature.",
         "2.1.5"),
   ```



##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -195,6 +195,8 @@ public class Manager extends AbstractServer implements 
LiveTServerSet.Listener,
   final AuditedSecurityOperation security;
   final Map<TServerInstance,AtomicInteger> badServers =
       Collections.synchronizedMap(new HashMap<>());
+  final Map<TServerInstance,AtomicInteger> haltedServers =

Review Comment:
   ```suggestion
     final Map<TServerInstance,AtomicInteger> tserverHaltRpcAttempts =
   ```



##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1150,6 +1152,8 @@ private List<TabletMigration> 
checkMigrationSanity(Set<TabletServerId> current,
     long start = System.currentTimeMillis();
     final SortedMap<TServerInstance,TabletServerStatus> result = new 
ConcurrentSkipListMap<>();
     final RateLimiter shutdownServerRateLimiter = 
RateLimiter.create(MAX_SHUTDOWNS_PER_SEC);
+    final int maxTserverHalts = 
getConfiguration().getCount(Property.MANAGER_MAX_TSERVER_HALTS);

Review Comment:
   ```suggestion
       final int maxTserverRpcHaltAttempts = 
getConfiguration().getCount(Property.MANAGER_MAX_TSERVER_HALTS);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to