[
https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129770#comment-17129770
]
Michael Stack commented on HBASE-24526:
---------------------------------------
These went in last night; seem related:
{code}
commit 4486a565b5cd9b9304701bc24c0f7d30cf174711
Author: Duo Zhang <[email protected]>
Date: Tue Jun 9 11:07:16 2020 +0800
HBASE-24117 Shutdown AssignmentManager before ProcedureExecutor may cause
SCP to accidentally skip assigning a region (#1865)
Signed-off-by: Michael Stack <[email protected]>
commit dd1010c15d1737d6f83497ef56e4dad09d80ac74
Author: Duo Zhang <[email protected]>
Date: Tue Jun 9 08:14:00 2020 +0800
HBASE-24517 AssignmentManager.start should add meta region to
ServerStateNode (#1866)
Signed-off-by: Viraj Jasani <[email protected]>
Signed-off-by: Wellington Ramos Chevreuil <[email protected]>
{code}
> Deadlock executing assign meta procedure
> ----------------------------------------
>
> Key: HBASE-24526
> URL: https://issues.apache.org/jira/browse/HBASE-24526
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Affects Versions: 2.3.0
> Reporter: Nick Dimiduk
> Priority: Critical
>
> I have what appears to be a deadlock while assigning meta. During recovery,
> master creates the assign procedure for meta, and immediately marks meta as
> assigned in zookeeper. It then creates the subprocedure to open meta on the
> target region. However, the PEWorker pool is full of procedures that are
> stuck, I think because their calls to update meta are going nowhere. For what
> it's worth, the balancer is running concurrently, and has calculated a plan
> size of 41.
> From the master log,
> {noformat}
> 2020-06-06 00:34:07,314 INFO
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure:
> Starting pid=17802, ppid=17801,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
> state=OPEN, location=null; forceNewPlan=true, retain=false
> 2020-06-06 00:34:07,465 INFO
> org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta
> (replicaId=0) location in ZooKeeper as
> hbasedn139.example.com,16020,1591403576247
> 2020-06-06 00:34:07,466 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized
> subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> {noformat}
> {{pid=17803}} is not mentioned again. hbasedn139 never receives an
> {{openRegion}} RPC.
> Meanwhile, additional procedures are scheduled and picked up by workers, each
> getting "stuck". I see log lines for all 16 PEWorker threads, saying that
> they are stuck.
> {noformat}
> 2020-06-06 00:34:07,961 INFO
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE;
> TransitRegionStateProcedure table=IntegrationTestBigLinkedList,
> region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE
> 2020-06-06 00:34:07,961 INFO
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804
> updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa,
> regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956
> ...
> 2020-06-06 00:34:22,295 WARN
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck
> PEWorker-16(pid=17804), run time 14.3340 sec
> ...
> 2020-06-06 00:34:27,295 WARN
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck
> PEWorker-16(pid=17804), run time 19.3340 sec
> ...
> {noformat}
> The cluster stays in this state, with PEWorker thread stuck for upwards of 15
> minutes. Eventually master starts logging
> {noformat}
> 2020-06-06 00:50:18,033 INFO
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception,
> tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue
> is full on hbasedn139.example.com,16020,1591403576247, too many items queued
> ?, details=row
> 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.'
> on table 'hbase:meta' at region=hbase:meta,,1.
> 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1,
> see https://s.apache.org/timeout
> {noformat}
> The master never recovers on its own.
> I'm not sure how common this condition might be. This popped after about 20
> total hours of running ITBLL with ServerKillingMonkey.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)