[
https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149647#comment-17149647
]
Nick Dimiduk commented on HBASE-24526:
--------------------------------------
My test harness found this issue again last night: full of stuck workers.
> Deadlock executing assign meta procedure
> ----------------------------------------
>
> Key: HBASE-24526
> URL: https://issues.apache.org/jira/browse/HBASE-24526
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Affects Versions: 2.3.0
> Reporter: Nick Dimiduk
> Priority: Critical
>
> I have what appears to be a deadlock while assigning meta. During recovery,
> master creates the assign procedure for meta, and immediately marks meta as
> assigned in zookeeper. It then creates the subprocedure to open meta on the
> target region. However, the PEWorker pool is full of procedures that are
> stuck, I think because their calls to update meta are going nowhere. For what
> it's worth, the balancer is running concurrently, and has calculated a plan
> size of 41.
> From the master log,
> {noformat}
> 2020-06-06 00:34:07,314 INFO
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure:
> Starting pid=17802, ppid=17801,
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
> state=OPEN, location=null; forceNewPlan=true, retain=false
> 2020-06-06 00:34:07,465 INFO
> org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta
> (replicaId=0) location in ZooKeeper as
> hbasedn139.example.com,16020,1591403576247
> 2020-06-06 00:34:07,466 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized
> subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> {noformat}
> {{pid=17803}} is not mentioned again. hbasedn139 never receives an
> {{openRegion}} RPC.
> Meanwhile, additional procedures are scheduled and picked up by workers, each
> getting "stuck". I see log lines for all 16 PEWorker threads, saying that
> they are stuck.
> {noformat}
> 2020-06-06 00:34:07,961 INFO
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE;
> TransitRegionStateProcedure table=IntegrationTestBigLinkedList,
> region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE
> 2020-06-06 00:34:07,961 INFO
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804
> updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa,
> regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956
> ...
> 2020-06-06 00:34:22,295 WARN
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck
> PEWorker-16(pid=17804), run time 14.3340 sec
> ...
> 2020-06-06 00:34:27,295 WARN
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck
> PEWorker-16(pid=17804), run time 19.3340 sec
> ...
> {noformat}
> The cluster stays in this state, with PEWorker thread stuck for upwards of 15
> minutes. Eventually master starts logging
> {noformat}
> 2020-06-06 00:50:18,033 INFO
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception,
> tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue
> is full on hbasedn139.example.com,16020,1591403576247, too many items queued
> ?, details=row
> 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.'
> on table 'hbase:meta' at region=hbase:meta,,1.
> 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1,
> see https://s.apache.org/timeout
> {noformat}
> The master never recovers on its own.
> I'm not sure how common this condition might be. This popped after about 20
> total hours of running ITBLL with ServerKillingMonkey.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)