[jira] [Updated] (HBASE-20680) Master hung during initialization waiting on hbase:meta to be assigned which never does

Ted Yu (JIRA) Mon, 04 Jun 2018 07:53:08 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ted Yu updated HBASE-20680:
---------------------------
    Description: 
When running IntegrationTestRSGroups, the test became hung waiting on the 
master to be initialized.

The hbase cluster was launched without RSGroup config. The test script adds 
required RSGroup configs to hbase-site.xml and restarts the cluster.

It seems that, at one point while the master was trying to assign meta, the 
destination regionserver was in the middle of going down. This has now left 
HBase in a state where it starts the regionserver recovery procedures, but 
never actually gets hbase:meta assigned.

{code}
2018-06-01 10:47:50,024 INFO  [PEWorker-5] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=41, ppid=40, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740}]

2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: hsync completed for 
hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log

2018-06-01 10:47:50,026 INFO  [PEWorker-3] procedure.MasterProcedureScheduler: 
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                 region=1588230740 checking lock on 1588230740

2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting 
location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
for rit=OFFLINE, location=ctr-  
e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, 
table=hbase:meta, region=1588230740 last loc=null

2018-06-01 10:47:50,026 INFO  [PEWorker-3] assignment.AssignProcedure: Starting 
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                region=1588230740; rit=OFFLINE, 
location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190; 
forceNewPlan=false, retain=true target svr=null
{code}

At Fri Jun  1 10:48:04, master was restarted.

The new master picked up pid=41:

{code}
2018-06-01 10:48:47,971 INFO  [PEWorker-1] assignment.AssignProcedure: Starting 
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                region=1588230740; rit=OFFLINE, location=null; 
forceNewPlan=false, retain=false target svr=null
{code}

There was no further log for pid=41 after above.

Later when master initiated another meta recovery procedure (pid=42), the 
second procedure seems to be locked out by the former:

{code}
2018-06-01 10:49:34,292 INFO  [PEWorker-2] procedure.MasterProcedureScheduler: 
pid=43, ppid=42, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                 region=1588230740, 
target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
checking lock on 1588230740

2018-06-01 10:49:34,293 DEBUG [PEWorker-2] 
assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, 
namespaceLocks={}, tableLocks={},                                 
regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, 
waitingProcCount=1}}, peerLocks={}
{code}

  was:
When running IntegrationTestRSGroups, the test became hung waiting on the 
master to be initialized.

The hbase cluster was launched without RSGroup config. The test script adds 
required RSGroup configs to hbase-site.xml and restarts the cluster.
It seems that, at one point while the master was trying to assign meta, the 
destination regionserver was in the middle of going down. This has now left 
HBase in a state where it starts the regionserver recovery procedures, but 
never actually gets hbase:meta assigned.

{code}

2018-06-01 10:47:50,024 INFO  [PEWorker-5] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=41, ppid=40, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
region=1588230740}]

2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: hsync completed for 
hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/   
   MasterProcWALs/pv2-00000000000000000002.log

2018-06-01 10:47:50,026 INFO  [PEWorker-3] procedure.MasterProcedureScheduler: 
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                 region=1588230740 checking lock on 1588230740

2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting 
location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
for rit=OFFLINE, location=ctr-  
e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, 
table=hbase:meta, region=1588230740 last loc=null

2018-06-01 10:47:50,026 INFO  [PEWorker-3] assignment.AssignProcedure: Starting 
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                region=1588230740; rit=OFFLINE, 
location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190; 
forceNewPlan=false, retain=true target svr=null

{code}

At Fri Jun  1 10:48:04, master was restarted.

The new master picked up pid=41:

{code}

2018-06-01 10:48:47,971 INFO  [PEWorker-1] assignment.AssignProcedure: Starting 
pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                region=1588230740; rit=OFFLINE, location=null; 
forceNewPlan=false, retain=false target svr=null
{code}

There was no further log for pid=41 after above.

Later when master initiated another meta recovery procedure (pid=42), the 
second procedure seems to be locked out by the former:

{code}
2018-06-01 10:49:34,292 INFO  [PEWorker-2] procedure.MasterProcedureScheduler: 
pid=43, ppid=42, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=hbase:meta,                 region=1588230740, 
target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
checking lock on 1588230740

2018-06-01 10:49:34,293 DEBUG [PEWorker-2] 
assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, 
namespaceLocks={}, tableLocks={},                                 
regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, 
waitingProcCount=1}}, peerLocks={}
{code}


> Master hung during initialization waiting on hbase:meta to be assigned which 
> never does
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-20680
>                 URL: https://issues.apache.org/jira/browse/HBASE-20680
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Priority: Critical
>
> When running IntegrationTestRSGroups, the test became hung waiting on the 
> master to be initialized.
> The hbase cluster was launched without RSGroup config. The test script adds 
> required RSGroup configs to hbase-site.xml and restarts the cluster.
> It seems that, at one point while the master was trying to assign meta, the 
> destination regionserver was in the middle of going down. This has now left 
> HBase in a state where it starts the regionserver recovery procedures, but 
> never actually gets hbase:meta assigned.
> {code}
> 2018-06-01 10:47:50,024 INFO  [PEWorker-5] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=41, ppid=40, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740}]
> 2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore: hsync completed for 
> hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] 
> procedure.MasterProcedureScheduler: pid=41, ppid=40, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,     
>             region=1588230740 checking lock on 1588230740
> 2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting 
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
> for rit=OFFLINE, location=ctr-  
> e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, 
> table=hbase:meta, region=1588230740 last loc=null
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] assignment.AssignProcedure: 
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta,                region=1588230740; 
> rit=OFFLINE, 
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190;
>  forceNewPlan=false, retain=true target svr=null
> {code}
> At Fri Jun  1 10:48:04, master was restarted.
> The new master picked up pid=41:
> {code}
> 2018-06-01 10:48:47,971 INFO  [PEWorker-1] assignment.AssignProcedure: 
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta,                region=1588230740; 
> rit=OFFLINE, location=null; forceNewPlan=false, retain=false target svr=null
> {code}
> There was no further log for pid=41 after above.
> Later when master initiated another meta recovery procedure (pid=42), the 
> second procedure seems to be locked out by the former:
> {code}
> 2018-06-01 10:49:34,292 INFO  [PEWorker-2] 
> procedure.MasterProcedureScheduler: pid=43, ppid=42, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,     
>             region=1588230740, 
> target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
> checking lock on 1588230740
> 2018-06-01 10:49:34,293 DEBUG [PEWorker-2] 
> assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, 
> namespaceLocks={}, tableLocks={},                                 
> regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, 
> waitingProcCount=1}}, peerLocks={}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-20680) Master hung during initialization waiting on hbase:meta to be assigned which never does

Reply via email to