[ 
https://issues.apache.org/jira/browse/HBASE-20644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491093#comment-16491093
 ] 

Ted Yu commented on HBASE-20644:
--------------------------------

101383-master-ctr-e138-1518143905142-329221-01-000004.hwx.site.log was active 
master log collected today.

AM requeue'd regions since former servers were offline (there was cluster 
restart):
{code}
2018-05-25 15:41:51,543 INFO  [PEWorker-5] assignment.AssignProcedure: Server 
not online, re-queuing pid=974, ppid=957, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure        
table=SYSTEM.FUNCTION, region=122fca58c6ac8626bd29868348b8d7d7; rit=OPENING, 
location=ctr-e138-1518143905142-329221-01-000002.hwx.site,16020,1527261630657

The region was assigned to server 006:

2018-05-25 15:41:51,769 INFO  [PEWorker-2] assignment.RegionStateStore: pid=974 
updating hbase:meta row=122fca58c6ac8626bd29868348b8d7d7, regionState=OPENING, 
regionLocation=ctr-e138-   
1518143905142-329221-01-000006.hwx.site,16020,1527262866049
{code}
3 Phoenix table regions were still stuck in transition :
{code}
2018-05-25 15:42:56,546 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=ctr-e138-1518143905142-329221-01-000006.hwx.site,16020,    
1527262866049, table=SYSTEM.FUNCTION, region=122fca58c6ac8626bd29868348b8d7d7
2018-05-25 15:42:56,546 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=ctr-e138-1518143905142-329221-01-000006.hwx.site,16020,    
1527262866049, table=GRAMMAR_TABLE_INDEX, 
region=577e8deb885052ec45b8727603bd5cf9
2018-05-25 15:42:56,546 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=ctr-e138-1518143905142-329221-01-000006.hwx.site,16020,    
1527262866049, table=SYSTEM.CATALOG, region=8c8a2fa7b3de53bdb1a8c2066deac5ac
{code}
Found the following in master log (might be related to HBASE-20492):
{code}
2018-05-25 17:26:27,441 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
procedure2.ProcedureExecutor: Stored pid=1007, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH;       UnassignProcedure 
table=GRAMMAR_TABLE_INDEX, region=577e8deb885052ec45b8727603bd5cf9, 
server=ctr-e138-1518143905142-329221-01-000006.hwx.site,16020,1527262866049
2018-05-25 17:26:27,443 INFO  [PEWorker-15] procedure.MasterProcedureScheduler: 
pid=1007, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=GRAMMAR_TABLE_INDEX,         region=577e8deb885052ec45b8727603bd5cf9, 
server=ctr-e138-1518143905142-329221-01-000006.hwx.site,16020,1527262866049 
checking lock on 577e8deb885052ec45b8727603bd5cf9
2018-05-25 17:26:27,445 WARN  [PEWorker-15] 
assignment.RegionTransitionProcedure: Failed transition, suspend 1secs 
pid=1007, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=GRAMMAR_TABLE_INDEX, region=577e8deb885052ec45b8727603bd5cf9, 
server=ctr-e138-1518143905142-329221-01-000006.hwx.site,16020,1527262866049; 
rit=OPENING, location=ctr-e138-          
1518143905142-329221-01-000006.hwx.site,16020,1527262866049; waiting on 
rectified condition fixed by other Procedure or operator intervention
org.apache.hadoop.hbase.exceptions.UnexpectedStateException: Expected 
[SPLITTING, SPLIT, MERGING, OPEN, CLOSING] so could move to CLOSING but current 
state=OPENING
  at 
org.apache.hadoop.hbase.master.assignment.RegionStates$RegionStateNode.transitionState(RegionStates.java:158)
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1548)
  at 
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:203)
  at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:349)
  at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:94)
  at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
{code}


> Master shutdown due to service ClusterSchemaServiceImpl failing to start
> ------------------------------------------------------------------------
>
>                 Key: HBASE-20644
>                 URL: https://issues.apache.org/jira/browse/HBASE-20644
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Romil Choksi
>            Priority: Critical
>         Attachments: 
> 101383-master-ctr-e138-1518143905142-329221-01-000003.hwx.site.log, 
> 101383-master-ctr-e138-1518143905142-329221-01-000004.hwx.site.log, 
> 101383-regionserver-ctr-e138-1518143905142-329221-01-000002.hwx.site.log, 
> 101383-regionserver-ctr-e138-1518143905142-329221-01-000006.hwx.site.log, 
> 101383-regionserver-ctr-e138-1518143905142-329221-01-000007.hwx.site.log
>
>
> From hbase-hbase-master-ctr-e138-1518143905142-329221-01-000003.hwx.site.log :
> {code}
> 2018-05-23 22:14:29,750 ERROR 
> [master/ctr-e138-1518143905142-329221-01-000003:20000] master.HMaster: Failed 
> to become active master
> java.lang.IllegalStateException: Expected the service 
> ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service has FAILED
>         at 
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345)
>         at 
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291)
>         at 
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1054)
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:918)
>         at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2023)
> {code}
> Earlier in the log , the namespace region, 01a7f9ba9fffd691f261d3fbc620da06 , 
> was deemed OPEN on 01-000007.hwx.site,16020,1527112194788 which was declared 
> not online:
> {code}
> 2018-05-23 21:54:34,786 INFO  
> [master/ctr-e138-1518143905142-329221-01-000003:20000] 
> assignment.RegionStateStore: Load hbase:meta entry                            
>              region=01a7f9ba9fffd691f261d3fbc620da06, regionState=OPEN, 
> lastHost=ctr-e138-1518143905142-329221-01-000007.hwx.site,16020,1527112194788,
>  
> regionLocation=ctr-e138-1518143905142-329221-01-000007.hwx.site,16020,1527112194788,
>  seqnum=43
> 2018-05-23 21:54:34,787 INFO  
> [master/ctr-e138-1518143905142-329221-01-000003:20000] 
> assignment.AssignmentManager: Number of RegionServers=1
> 2018-05-23 21:54:34,788 INFO  
> [master/ctr-e138-1518143905142-329221-01-000003:20000] 
> assignment.AssignmentManager: KILL 
> RegionServer=ctr-e138-1518143905142-329221-01-000007.   
> hwx.site,16020,1527112194788 hosting regions but not online.
> {code}
> Later, even though a different instance on 007 registered with master:
> {code}
> 2018-05-23 21:55:13,541 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=20000] 
> master.ServerManager: Registering 
> regionserver=ctr-e138-1518143905142-329221-01-000007.hwx.site,16020,1527112506002
> ...
> 2018-05-23 21:55:43,881 INFO  
> [master/ctr-e138-1518143905142-329221-01-000003:20000] 
> client.RpcRetryingCallerImpl: Call exception, tries=12, retries=12, 
> started=69001 ms ago,            cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: 
> hbase:namespace,,1527099443383.01a7f9ba9fffd691f261d3fbc620da06. is not 
> online on ctr-e138-1518143905142-329221-  
> 01-000007.hwx.site,16020,1527112506002
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3273)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3250)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2446)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
> {code}
> There was no OPEN request for 01a7f9ba9fffd691f261d3fbc620da06 sent to that 
> server instance.
> From 
> hbase-hbase-regionserver-ctr-e138-1518143905142-329221-01-000007.hwx.site.log 
> :
> {code}
> 2018-05-23 21:52:27,414 INFO  
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-329221-01-000007:16020-1]
>  regionserver.HRegion: Closed hbase:namespace,,1527099443383.       
> 01a7f9ba9fffd691f261d3fbc620da06.
> {code}
> Then region server 007 restarted:
> {code}
> Wed May 23 21:55:03 UTC 2018 Starting regionserver on 
> ctr-e138-1518143905142-329221-01-000007.hwx.site
> {code}
> After which the region 01a7f9ba9fffd691f261d3fbc620da06 never showed up again 
> in log 007



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to