dlmarion commented on code in PR #6049:
URL: https://github.com/apache/accumulo/pull/6049#discussion_r2682297045
##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1190,15 +1194,31 @@ private List<TabletMigration>
checkMigrationSanity(Set<TabletServerId> current,
> MAX_BAD_STATUS_COUNT) {
if (shutdownServerRateLimiter.tryAcquire()) {
log.warn("attempting to stop {}", server);
+ if (forceHaltingEnabled
+ && (haltedServers.computeIfAbsent(server, s -> new
AtomicInteger(0))
+ .incrementAndGet() > maxTserverHalts)) {
+ log.warn("tserver {} is not responding to halt requests,
deleting zlock", server);
+ var zk = getContext().getZooReaderWriter();
+ var iid = getContext().getInstanceID();
+ String tserversPath = Constants.ZROOT + "/" + iid +
Constants.ZTSERVERS;
+ try {
+ ServiceLock.deleteLocks(zk, tserversPath,
server.getHostAndPort()::equals,
+ log::info, false);
+ } catch (KeeperException | InterruptedException e) {
+ log.error("Failed to delete zlock for server {}", server);
Review Comment:
Would probably be useful to have the exception in the log as well.
```suggestion
log.error("Failed to delete zlock for server {}", server,
e);
```
##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1190,15 +1194,31 @@ private List<TabletMigration>
checkMigrationSanity(Set<TabletServerId> current,
> MAX_BAD_STATUS_COUNT) {
if (shutdownServerRateLimiter.tryAcquire()) {
log.warn("attempting to stop {}", server);
+ if (forceHaltingEnabled
+ && (haltedServers.computeIfAbsent(server, s -> new
AtomicInteger(0))
+ .incrementAndGet() > maxTserverHalts)) {
+ log.warn("tserver {} is not responding to halt requests,
deleting zlock", server);
+ var zk = getContext().getZooReaderWriter();
+ var iid = getContext().getInstanceID();
+ String tserversPath = Constants.ZROOT + "/" + iid +
Constants.ZTSERVERS;
+ try {
+ ServiceLock.deleteLocks(zk, tserversPath,
server.getHostAndPort()::equals,
+ log::info, false);
+ } catch (KeeperException | InterruptedException e) {
+ log.error("Failed to delete zlock for server {}", server);
+ }
+ haltedServers.remove(server);
+ }
Review Comment:
Wondering if this close should be an `else` clause instead. Without the
`else`, this will continue and try to execute the `halt` RPC after deleting the
lock.
##########
core/src/main/java/org/apache/accumulo/core/conf/Property.java:
##########
@@ -450,6 +450,10 @@ public enum Property {
"The number of threads used to run fault-tolerant executions (FATE)."
+ " These are primarily table operations like merge.",
"1.4.3"),
+ MANAGER_MAX_TSERVER_HALTS("manager.max.tservers.halts", "0",
PropertyType.COUNT,
+ "Allows the manager to force tserver halting by setting the max number
of attempted tserver halt "
+ + " requests before deleting the tserver's zlock. A value of zero
(default) disables this feature.",
+ "2.1.5"),
Review Comment:
You don't have to apply this suggestion, was just thinking of different
naming
```suggestion
MANAGER_TSERVER_HALT_ATTEMPTS("manager.tservers.halt.attempts", "0",
PropertyType.COUNT,
"Allows the manager to remove the tserver lock in zookeeper after this
many failed RPC halt attempts. A value of zero (default) disables this
feature.",
"2.1.5"),
```
##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -195,6 +195,8 @@ public class Manager extends AbstractServer implements
LiveTServerSet.Listener,
final AuditedSecurityOperation security;
final Map<TServerInstance,AtomicInteger> badServers =
Collections.synchronizedMap(new HashMap<>());
+ final Map<TServerInstance,AtomicInteger> haltedServers =
Review Comment:
```suggestion
final Map<TServerInstance,AtomicInteger> tserverHaltRpcAttempts =
```
##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -1150,6 +1152,8 @@ private List<TabletMigration>
checkMigrationSanity(Set<TabletServerId> current,
long start = System.currentTimeMillis();
final SortedMap<TServerInstance,TabletServerStatus> result = new
ConcurrentSkipListMap<>();
final RateLimiter shutdownServerRateLimiter =
RateLimiter.create(MAX_SHUTDOWNS_PER_SEC);
+ final int maxTserverHalts =
getConfiguration().getCount(Property.MANAGER_MAX_TSERVER_HALTS);
Review Comment:
```suggestion
final int maxTserverRpcHaltAttempts =
getConfiguration().getCount(Property.MANAGER_MAX_TSERVER_HALTS);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]