[
https://issues.apache.org/jira/browse/HBASE-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500394#comment-16500394
]
Ted Yu commented on HBASE-20680:
--------------------------------
>From regionserver-ctr-e138-1518143905142-340983-03-000014.hwx.site.log which
>restarted at Fri Jun 1 10:49:46 :
{code}
2018-06-01 10:49:52,588 DEBUG
[RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer:
callId: 168 service: AdminService methodName: GetRegionInfo size: 47
connection: 172.27.86.210:47868 deadline: 1527850282584,
exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is
not online on
ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527850188745
2018-06-01 10:49:52,693 DEBUG
[RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer:
callId: 169 service: AdminService methodName: GetRegionInfo size: 47
connection: 172.27.86.210:47868 deadline: 1527850282693,
exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is
not online on
ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527850188745
{code}
Such log repeated until:
{code}
2018-06-01 16:24:45,308 DEBUG
[RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer:
callId: 197326 service: AdminService methodName: GetRegionInfo size: 48
connection: 172.27.86.210:47868 deadline: 1527870375308,
exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is
not online on ctr-e138-1518143905142-340983-03-
000014.hwx.site,16020,1527850188745
{code}
> Master hung during initialization waiting on hbase:meta to be assigned which
> never does
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-20680
> URL: https://issues.apache.org/jira/browse/HBASE-20680
> Project: HBase
> Issue Type: Bug
> Reporter: Josh Elser
> Priority: Critical
> Attachments: 20680-logs.tar.gz
>
>
> When running IntegrationTestRSGroups, the test became hung waiting on the
> master to be initialized.
> The hbase cluster was launched without RSGroup config. The test script adds
> required RSGroup configs to hbase-site.xml and restarts the cluster.
> It seems that, at one point while the master was trying to assign meta, the
> destination regionserver was in the middle of going down. This has now left
> HBase in a state where it starts the regionserver recovery procedures, but
> never actually gets hbase:meta assigned.
> {code}
> 2018-06-01 10:47:50,024 INFO [PEWorker-5] procedure2.ProcedureExecutor:
> Initialized subprocedures=[{pid=41, ppid=40,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
> region=1588230740}]
> 2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread]
> wal.WALProcedureStore: hsync completed for
> hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log
> 2018-06-01 10:47:50,026 INFO [PEWorker-3]
> procedure.MasterProcedureScheduler: pid=41, ppid=40,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
> region=1588230740 checking lock on 1588230740
> 2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190
> for rit=OFFLINE, location=ctr-
> e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190,
> table=hbase:meta, region=1588230740 last loc=null
> 2018-06-01 10:47:50,026 INFO [PEWorker-3] assignment.AssignProcedure:
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=hbase:meta, region=1588230740;
> rit=OFFLINE,
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190;
> forceNewPlan=false, retain=true target svr=null
> {code}
> At Fri Jun 1 10:48:04, master was restarted.
> The new master picked up pid=41:
> {code}
> 2018-06-01 10:48:47,971 INFO [PEWorker-1] assignment.AssignProcedure:
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE;
> AssignProcedure table=hbase:meta, region=1588230740;
> rit=OFFLINE, location=null; forceNewPlan=false, retain=false target svr=null
> {code}
> There was no further log for pid=41 after above.
> Later when master initiated another meta recovery procedure (pid=42), the
> second procedure seems to be locked out by the former:
> {code}
> 2018-06-01 10:49:34,292 INFO [PEWorker-2]
> procedure.MasterProcedureScheduler: pid=43, ppid=42,
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
> region=1588230740,
> target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190
> checking lock on 1588230740
> 2018-06-01 10:49:34,293 DEBUG [PEWorker-2]
> assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={},
> namespaceLocks={}, tableLocks={},
> regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0,
> waitingProcCount=1}}, peerLocks={}
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)