[ https://issues.apache.org/jira/browse/HBASE-20644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489951#comment-16489951 ]
Ted Yu commented on HBASE-20644: -------------------------------- Master was waiting for server to check in: {code} 2018-05-23 21:54:27,893 WARN [master/ctr-e138-1518143905142-329221-01-000003:20000] assignment.AssignmentManager: No servers available; cannot place 1 unassigned regions. 2018-05-23 21:54:28,877 INFO [master/ctr-e138-1518143905142-329221-01-000003:20000] master.ServerManager: Waiting on regionserver count=0; waited=42119ms, expecting min=1 server(s), max=NO_LIMIT server(s), timeout=30000ms, lastChange=-42119ms {code} Finally when 002 checked in, master started to assign regions to it: {code} 2018-05-23 21:54:28,984 INFO [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=20000] master.ServerManager: Registering regionserver=ctr-e138-1518143905142-329221-01- 000002.hwx.site,16020,1527112463065 2018-05-23 21:54:29,033 INFO [master/ctr-e138-1518143905142-329221-01-000003:20000] master.ServerManager: Waiting on regionserver count=1; waited=42274ms, expecting min=1 server(s), max=NO_LIMIT server(s), timeout=30000ms, lastChange=0ms {code} Here is related code: {code} for (ServerName serverName: offlineServersWithOnlineRegions) { if (!master.getServerManager().isServerOnline(serverName)) { LOG.info("KILL RegionServer=" + serverName + " hosting regions but not online."); killRegionServer(serverName); {code} This is how AM handles offline servers. > Master shutdown due to service ClusterSchemaServiceImpl failing to start > ------------------------------------------------------------------------ > > Key: HBASE-20644 > URL: https://issues.apache.org/jira/browse/HBASE-20644 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0 > Reporter: Romil Choksi > Priority: Major > Attachments: > 101383-master-ctr-e138-1518143905142-329221-01-000003.hwx.site.log, > 101383-regionserver-ctr-e138-1518143905142-329221-01-000002.hwx.site.log, > 101383-regionserver-ctr-e138-1518143905142-329221-01-000007.hwx.site.log > > > From hbase-hbase-master-ctr-e138-1518143905142-329221-01-000003.hwx.site.log : > {code} > 2018-05-23 22:14:29,750 ERROR > [master/ctr-e138-1518143905142-329221-01-000003:20000] master.HMaster: Failed > to become active master > java.lang.IllegalStateException: Expected the service > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service has FAILED > at > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345) > at > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291) > at > org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1054) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:918) > at > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2023) > {code} > Earlier in the log , the namespace region, 01a7f9ba9fffd691f261d3fbc620da06 , > was deemed OPEN on 01-000007.hwx.site,16020,1527112194788 which was declared > not online: > {code} > 2018-05-23 21:54:34,786 INFO > [master/ctr-e138-1518143905142-329221-01-000003:20000] > assignment.RegionStateStore: Load hbase:meta entry > region=01a7f9ba9fffd691f261d3fbc620da06, regionState=OPEN, > lastHost=ctr-e138-1518143905142-329221-01-000007.hwx.site,16020,1527112194788, > > regionLocation=ctr-e138-1518143905142-329221-01-000007.hwx.site,16020,1527112194788, > seqnum=43 > 2018-05-23 21:54:34,787 INFO > [master/ctr-e138-1518143905142-329221-01-000003:20000] > assignment.AssignmentManager: Number of RegionServers=1 > 2018-05-23 21:54:34,788 INFO > [master/ctr-e138-1518143905142-329221-01-000003:20000] > assignment.AssignmentManager: KILL > RegionServer=ctr-e138-1518143905142-329221-01-000007. > hwx.site,16020,1527112194788 hosting regions but not online. > {code} > Later, even though a different instance on 007 registered with master: > {code} > 2018-05-23 21:55:13,541 INFO > [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=20000] > master.ServerManager: Registering > regionserver=ctr-e138-1518143905142-329221-01-000007.hwx.site,16020,1527112506002 > ... > 2018-05-23 21:55:43,881 INFO > [master/ctr-e138-1518143905142-329221-01-000003:20000] > client.RpcRetryingCallerImpl: Call exception, tries=12, retries=12, > started=69001 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.NotServingRegionException: > hbase:namespace,,1527099443383.01a7f9ba9fffd691f261d3fbc620da06. is not > online on ctr-e138-1518143905142-329221- > 01-000007.hwx.site,16020,1527112506002 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3273) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3250) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2446) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131) > {code} > There was no OPEN request for 01a7f9ba9fffd691f261d3fbc620da06 sent to that > server instance. > From > hbase-hbase-regionserver-ctr-e138-1518143905142-329221-01-000007.hwx.site.log > : > {code} > 2018-05-23 21:52:27,414 INFO > [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-329221-01-000007:16020-1] > regionserver.HRegion: Closed hbase:namespace,,1527099443383. > 01a7f9ba9fffd691f261d3fbc620da06. > {code} > Then region server 007 restarted: > {code} > Wed May 23 21:55:03 UTC 2018 Starting regionserver on > ctr-e138-1518143905142-329221-01-000007.hwx.site > {code} > After which the region 01a7f9ba9fffd691f261d3fbc620da06 never showed up again > in log 007 -- This message was sent by Atlassian JIRA (v7.6.3#76005)