[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638611#comment-16638611 ]
Andrew Purtell edited comment on HBASE-21266 at 10/4/18 5:48 PM: ----------------------------------------------------------------- bq. "Number of dead servers in processing should always be non-negative" You are looking at that assert in DeadServer#finish, right? Those aren't evaulated unless the JVM is started with the -ea command line flag, which I didn't do. We can see from the log line I did see that the dead server map was empty at the time so I agree we should look at accounting in DeadServer.java. "Not running balancer because processing dead regionserver(s)" is printed from HMaster.java:1846 based on the result from ServerManager#areDeadServersInProgress, which passes through the result from DeadServer#areDeadServersInProgress, which is simply {code} public synchronized boolean areDeadServersInProgress() { return processing; } {code} This boolean is cleared in DeadServer#finish when {code} if (numProcessing == 0) { processing = false; } {code} So the first question I have is why do we even need this boolean field? It can easily be derived cheaply from other state. In areDeadServersInProgress just return the result of {{!(numProcessing == 0)}}. That assert you observed should be replaced by use of Preconditions so we will get a RuntimeException that will get noticed. was (Author: apurtell): bq. "Number of dead servers in processing should always be non-negative" You are looking at that assert in DeadServer#finish, right? Those aren't evaulated unless the JVM is started with the -ea command line flag, which I didn't do. We can see from the log line I did see that the dead server map was empty at the time so I agree we should look at accounting in DeadServer.java. "Not running balancer because processing dead regionserver(s)" is printed from HMaster.java:1846 based on the result from ServerManager#areDeadServersInProgress, which passes through the result from DeadServer#areDeadServersInProgress, which is simply {code} public synchronized boolean areDeadServersInProgress() { return processing; } {code} This boolean is cleared in DeadServer#finish when {code} if (numProcessing == 0) { processing = false; } {code} So the first question I have is why do we even need this boolean field? It can easily be derived cheaply from other state. In areDeadServersInProgress just return the result of {{numProcessing == 0}}. That assert you observed should be replaced by use of Preconditions so we will get a RuntimeException that will get noticed. > Not running balancer because processing dead regionservers, but empty dead rs > list > ---------------------------------------------------------------------------------- > > Key: HBASE-21266 > URL: https://issues.apache.org/jira/browse/HBASE-21266 > Project: HBase > Issue Type: Bug > Affects Versions: 1.4.8 > Reporter: Andrew Purtell > Priority: Major > Fix For: 1.5.0, 1.4.9 > > > Found during ITBLL testing. AM in master gets into a state where manual > attempts from the shell to run the balancer always return false and this is > printed in the master log: > 2018-10-03 19:17:14,892 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: > Not running balancer because processing dead regionserver(s): > Note the empty list. > This errant state did not recover without intervention by way of master > restart, but the test environment was chaotic so needs investigation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)