[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure

Michael Stack (Jira) Tue, 09 Jun 2020 13:26:03 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129770#comment-17129770
 ]


Michael Stack commented on HBASE-24526:
---------------------------------------

These went in last night; seem related:

{code}
commit 4486a565b5cd9b9304701bc24c0f7d30cf174711
Author: Duo Zhang <[email protected]>
Date:   Tue Jun 9 11:07:16 2020 +0800

    HBASE-24117 Shutdown AssignmentManager before ProcedureExecutor may cause 
SCP to accidentally skip assigning a region (#1865)

    Signed-off-by: Michael Stack <[email protected]>

commit dd1010c15d1737d6f83497ef56e4dad09d80ac74
Author: Duo Zhang <[email protected]>
Date:   Tue Jun 9 08:14:00 2020 +0800

    HBASE-24517 AssignmentManager.start should add meta region to 
ServerStateNode (#1866)

    Signed-off-by: Viraj Jasani <[email protected]>
    Signed-off-by: Wellington Ramos Chevreuil <[email protected]>
{code}

> Deadlock executing assign meta procedure
> ----------------------------------------
>
>                 Key: HBASE-24526
>                 URL: https://issues.apache.org/jira/browse/HBASE-24526
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2, Region Assignment
>    Affects Versions: 2.3.0
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> I have what appears to be a deadlock while assigning meta. During recovery, 
> master creates the assign procedure for meta, and immediately marks meta as 
> assigned in zookeeper. It then creates the subprocedure to open meta on the 
> target region. However, the PEWorker pool is full of procedures that are 
> stuck, I think because their calls to update meta are going nowhere. For what 
> it's worth, the balancer is running concurrently, and has calculated a plan 
> size of 41.
> From the master log,
> {noformat}
> 2020-06-06 00:34:07,314 INFO 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: 
> Starting pid=17802, ppid=17801, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> state=OPEN, location=null; forceNewPlan=true, retain=false
> 2020-06-06 00:34:07,465 INFO 
> org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta 
> (replicaId=0) location in ZooKeeper as 
> hbasedn139.example.com,16020,1591403576247
> 2020-06-06 00:34:07,466 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> {noformat}
> {{pid=17803}} is not mentioned again. hbasedn139 never receives an 
> {{openRegion}} RPC.
> Meanwhile, additional procedures are scheduled and picked up by workers, each 
> getting "stuck". I see log lines for all 16 PEWorker threads, saying that 
> they are stuck.
> {noformat}
> 2020-06-06 00:34:07,961 INFO 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; 
> TransitRegionStateProcedure table=IntegrationTestBigLinkedList, 
> region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE
> 2020-06-06 00:34:07,961 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 
> updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, 
> regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956
> ...
> 2020-06-06 00:34:22,295 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck 
> PEWorker-16(pid=17804), run time 14.3340 sec
> ...
> 2020-06-06 00:34:27,295 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck 
> PEWorker-16(pid=17804), run time 19.3340 sec
> ...
> {noformat}
> The cluster stays in this state, with PEWorker thread stuck for upwards of 15 
> minutes. Eventually master starts logging
> {noformat}
> 2020-06-06 00:50:18,033 INFO 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, 
> tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue 
> is full on hbasedn139.example.com,16020,1591403576247, too many items queued 
> ?, details=row 
> 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.'
>  on table 'hbase:meta' at region=hbase:meta,,1.
> 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, 
> see https://s.apache.org/timeout
> {noformat}
> The master never recovers on its own.
> I'm not sure how common this condition might be. This popped after about 20 
> total hours of running ITBLL with ServerKillingMonkey.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure

Reply via email to