[ 
https://issues.apache.org/jira/browse/HBASE-21380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663954#comment-16663954
 ] 

Mike Drob commented on HBASE-21380:
-----------------------------------

Continuing to look into this...

Here's a snippet from my logs:

{noformat}
2018-10-25 10:34:18,231 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
procedure2.ProcedureExecutor(522): Completed pid=9, state=SUCCESS; 
ServerCrashProcedure server=mdrob-mbp.hsd1.tx.comcast.net,60040,1540481628186, 
splitWal=true, meta=true
2018-10-25 10:34:18,232 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
procedure2.ProcedureExecutor(522): Completed pid=10, state=SUCCESS; 
ServerCrashProcedure server=mdrob-mbp.hsd1.tx.comcast.net,60037,1540481628106, 
splitWal=true, meta=false
2018-10-25 10:34:18,232 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
procedure2.ProcedureExecutor(522): Completed pid=11, state=SUCCESS; 
ServerCrashProcedure server=mdrob-mbp.hsd1.tx.comcast.net,60111,1540481641705, 
splitWal=true, meta=false
2018-10-25 10:34:18,232 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
procedure2.ProcedureExecutor(522): Completed pid=12, state=SUCCESS; 
ServerCrashProcedure server=mdrob-mbp.hsd1.tx.comcast.net,60034,1540481627993, 
splitWal=true, meta=false
2018-10-25 10:34:18,233 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
procedure2.ProcedureExecutor(522): Completed pid=16, state=SUCCESS; 
CreateTableProcedure table=hbase:quota
2018-10-25 10:34:18,233 INFO  [master/mdrob-mbp:0:becomeActiveMaster] 
procedure2.ProcedureExecutor(729): Loaded WALProcedureStore in 14msec
2018-10-25 10:34:18,233 INFO  [master/mdrob-mbp:0:becomeActiveMaster] 
procedure2.RemoteProcedureDispatcher(97): Instantiated, coreThreads=128 
(allowCoreThreadTimeOut=true), queueMaxSize=32, operationDelay=150
2018-10-25 10:34:18,235 INFO  [master/mdrob-mbp:0:becomeActiveMaster] 
master.RegionServerTracker(123): Starting RegionServerTracker; 4 have existing 
ServerCrashProcedures, 3 possibly 'live' servers, and 0 'splitting'.
2018-10-25 10:34:18,236 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
master.DeadServer(136): Added 
mdrob-mbp.hsd1.tx.comcast.net,60040,1540481628186; numProcessing=1
java.lang.Exception: trace
2018-10-25 10:34:18,236 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
master.DeadServer(139): Added 
mdrob-mbp.hsd1.tx.comcast.net,60037,1540481628106; numProcessing=2
2018-10-25 10:34:18,236 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
master.DeadServer(139): Added 
mdrob-mbp.hsd1.tx.comcast.net,60111,1540481641705; numProcessing=3
2018-10-25 10:34:18,236 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
master.DeadServer(139): Added 
mdrob-mbp.hsd1.tx.comcast.net,60034,1540481627993; numProcessing=4
2018-10-25 10:34:18,237 DEBUG [master/mdrob-mbp:0:becomeActiveMaster] 
master.DeadServer(139): Added 
mdrob-mbp.hsd1.tx.comcast.net,60145,1540481644975; numProcessing=5
{noformat}

The line numbers for the DeadServer log messages are a bit off from what's in 
branch-2.1 because I added some additional checks around there. The trace for 
those calls is:

{noformat}
        at org.apache.hadoop.hbase.master.DeadServer.add(DeadServer.java:137)
        at java.lang.Iterable.forEach(Iterable.java:75)
        at 
org.apache.hadoop.hbase.master.ServerManager.findDeadServersAndProcess(ServerManager.java:321)
        at 
org.apache.hadoop.hbase.master.RegionServerTracker.start(RegionServerTracker.java:145)
        at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:904)
        at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2254)
        at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:583)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

With the really interesting part being that the RegionServerTracker starts with 
a list of servers derived from the current set of SCPs. But if the SCPs are 
done then they'll never call finish() on the dead servers list and they'll 
never be removed from the "processing" list, and then even a user request won't 
clear them out of dead server space.

> TestRSGroups failing
> --------------------
>
>                 Key: HBASE-21380
>                 URL: https://issues.apache.org/jira/browse/HBASE-21380
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.1.1
>            Reporter: Sean Busbey
>            Assignee: Mike Drob
>            Priority: Major
>
> only failing on branch-2.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to