Michael Stack created HBASE-23593:
-------------------------------------
Summary: Stalled SCP Assigns
Key: HBASE-23593
URL: https://issues.apache.org/jira/browse/HBASE-23593
Project: HBase
Issue Type: Bug
Components: proc-v2
Affects Versions: 2.2.3
Reporter: Michael Stack
I'm stuck on this one so doing a write up here in case anyone else has ideas.
Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem but
from time to time I'll see the SCP stuck waiting on an Assign to finish. The
assign seems stuck at the queuing of the OpenRegionProcedure. We've stored the
procedure but then not a peek thereafter. Later we'll see complaint that the
region is STUCK. Doesn't recover. Doesn't run.
Basic story is as follows:
Server dies:
{code}
2019-12-17 11:10:42,002 INFO
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node
deleted, processing expiration [s011.example.org,16020,1576561318119]
2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: Added
s011.example.org,16020,1576561318119; numProcessing=1
...
2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer:
Started processing s011.example.org,16020,1576561318119; numProcessing=1
{code}
The dead server restarts which purges the old server from dead server and
processing lists:
{code}
2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer:
Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager:
STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it
from the dead servers list
{code}
....even though we are still processing logs in the SCP of the old server...
{code}
2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil:
Archived processed log
hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
to hdfs://nameservice1/hbase/oldWALs/s011.example.
org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
{code}
I thought early purge of deadserver was a problem but I don't think so after
study.
WALS split took two minutes to split and server was removed from dead
servers... three minutes earlier...
{code}
2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager:
Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 log
files in
[hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
in 143236ms
{code}
Almost immediately we get this:
{code}
2019-12-17 11:14:08,649 WARN
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
Region-In-Transition state=OPEN, location=s011.example.org,16020,1576561318119,
table=t1, region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
{code}
For this region assign, I see the SCP proc making an assign for this region
which then makes a subtask to OpenRegionProcedure. This is where it gets stuck.
No progress after this. The procedure does not come alive to run.
Here are logs for the ORP pid=421761:
{code}
2019-12-17 11:38:34,761 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized
subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
2019-12-17 11:38:34,765 DEBUG
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add
TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: the
exclusive lock is not held by anyone when adding pid=421761, ppid=402475,
state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
2019-12-17 11:38:34,770 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=421761, ppid=402475, state=RUNNABLE, locked=true;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th
rollback step
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)