Ted Yu created HBASE-20680:
------------------------------
Summary: Master hung during initialization waiting on hbase:meta
to be assigned which never does
Key: HBASE-20680
URL: https://issues.apache.org/jira/browse/HBASE-20680
Project: HBase
Issue Type: Bug
Reporter: Josh Elser
When running IntegrationTestRSGroups, the test became hung waiting on the
master to be initialized.
The hbase cluster was launched without RSGroup config. The test script adds
required RSGroup configs to hbase-site.xml and restarts the cluster.
It seems that, at one point while the master was trying to assign meta, the
destination regionserver was in the middle of going down. This has now left
HBase in a state where it starts the regionserver recovery procedures, but
never actually gets hbase:meta assigned.
{code}
2018-06-01 10:47:50,024 INFO [PEWorker-5] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=41, ppid=40,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,
region=1588230740}]
2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread]
wal.WALProcedureStore: hsync completed for
hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/
MasterProcWALs/pv2-00000000000000000002.log
2018-06-01 10:47:50,026 INFO [PEWorker-3] procedure.MasterProcedureScheduler:
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:meta, region=1588230740 checking lock on 1588230740
2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting
location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190
for rit=OFFLINE, location=ctr-
e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190,
table=hbase:meta, region=1588230740 last loc=null
2018-06-01 10:47:50,026 INFO [PEWorker-3] assignment.AssignProcedure: Starting
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:meta, region=1588230740; rit=OFFLINE,
location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190;
forceNewPlan=false, retain=true target svr=null
{code}
At Fri Jun 1 10:48:04, master was restarted.
The new master picked up pid=41:
{code}
2018-06-01 10:48:47,971 INFO [PEWorker-1] assignment.AssignProcedure: Starting
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:meta, region=1588230740; rit=OFFLINE, location=null;
forceNewPlan=false, retain=false target svr=null
{code}
There was no further log for pid=41 after above.
Later when master initiated another meta recovery procedure (pid=42), the
second procedure seems to be locked out by the former:
{code}
2018-06-01 10:49:34,292 INFO [PEWorker-2] procedure.MasterProcedureScheduler:
pid=43, ppid=42, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure
table=hbase:meta, region=1588230740,
target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190
checking lock on 1588230740
2018-06-01 10:49:34,293 DEBUG [PEWorker-2]
assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={},
namespaceLocks={}, tableLocks={},
regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0,
waitingProcCount=1}}, peerLocks={}
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)