[jira] [Commented] (HBASE-20680) Master hung during initialization waiting on hbase:meta to be assigned which never does

Ted Yu (JIRA) Mon, 04 Jun 2018 08:42:32 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500394#comment-16500394
 ]


Ted Yu commented on HBASE-20680:
--------------------------------

>From regionserver-ctr-e138-1518143905142-340983-03-000014.hwx.site.log which 
>restarted at Fri Jun  1 10:49:46 :
{code}
2018-06-01 10:49:52,588 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer: 
callId: 168 service: AdminService methodName: GetRegionInfo size: 47 
connection: 172.27.86.210:47868 deadline: 1527850282584, 
exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is 
not online on 
ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527850188745
2018-06-01 10:49:52,693 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer: 
callId: 169 service: AdminService methodName: GetRegionInfo size: 47 
connection: 172.27.86.210:47868 deadline: 1527850282693, 
exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is 
not online on 
ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527850188745
{code}
Such log repeated until:
{code}
2018-06-01 16:24:45,308 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer: 
callId: 197326 service: AdminService methodName: GetRegionInfo size: 48         
 connection: 172.27.86.210:47868 deadline: 1527870375308, 
exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is 
not online on ctr-e138-1518143905142-340983-03-    
000014.hwx.site,16020,1527850188745
{code}

> Master hung during initialization waiting on hbase:meta to be assigned which 
> never does
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-20680
>                 URL: https://issues.apache.org/jira/browse/HBASE-20680
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Priority: Critical
>         Attachments: 20680-logs.tar.gz
>
>
> When running IntegrationTestRSGroups, the test became hung waiting on the 
> master to be initialized.
> The hbase cluster was launched without RSGroup config. The test script adds 
> required RSGroup configs to hbase-site.xml and restarts the cluster.
> It seems that, at one point while the master was trying to assign meta, the 
> destination regionserver was in the middle of going down. This has now left 
> HBase in a state where it starts the regionserver recovery procedures, but 
> never actually gets hbase:meta assigned.
> {code}
> 2018-06-01 10:47:50,024 INFO  [PEWorker-5] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=41, ppid=40, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740}]
> 2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore: hsync completed for 
> hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] 
> procedure.MasterProcedureScheduler: pid=41, ppid=40, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,     
>             region=1588230740 checking lock on 1588230740
> 2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting 
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
> for rit=OFFLINE, location=ctr-  
> e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, 
> table=hbase:meta, region=1588230740 last loc=null
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] assignment.AssignProcedure: 
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta,                region=1588230740; 
> rit=OFFLINE, 
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190;
>  forceNewPlan=false, retain=true target svr=null
> {code}
> At Fri Jun  1 10:48:04, master was restarted.
> The new master picked up pid=41:
> {code}
> 2018-06-01 10:48:47,971 INFO  [PEWorker-1] assignment.AssignProcedure: 
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta,                region=1588230740; 
> rit=OFFLINE, location=null; forceNewPlan=false, retain=false target svr=null
> {code}
> There was no further log for pid=41 after above.
> Later when master initiated another meta recovery procedure (pid=42), the 
> second procedure seems to be locked out by the former:
> {code}
> 2018-06-01 10:49:34,292 INFO  [PEWorker-2] 
> procedure.MasterProcedureScheduler: pid=43, ppid=42, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,     
>             region=1588230740, 
> target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
> checking lock on 1588230740
> 2018-06-01 10:49:34,293 DEBUG [PEWorker-2] 
> assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, 
> namespaceLocks={}, tableLocks={},                                 
> regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, 
> waitingProcCount=1}}, peerLocks={}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20680) Master hung during initialization waiting on hbase:meta to be assigned which never does

Reply via email to