Michael Stack created HBASE-24368:
-------------------------------------
Summary: Let HBCKSCP clear 'Unknown Servers', even if
RegionStateNode has RegionLocation == null
Key: HBASE-24368
URL: https://issues.apache.org/jira/browse/HBASE-24368
Project: HBase
Issue Type: Bug
Components: hbck2
Affects Versions: 2.3.0
Reporter: Michael Stack
This is an incidental noticed when in a hole trying to fix up a cluster. The
'obvious' remediation didn't work. This issue is about addressing this.
HBASE-23594 added a filtering of Regions on the crashed server to handle the
case where an Assign may be concurrent to the ServerCrashProcedure. To avoid
double assign, the SCP will skip assign if the RegionStateNode RegionLocation
is not that of the crashed server.
This is good.
Where it is an obstacle is when a Region is stuck in OPENING state, it
references an 'Unknown Server' -- a server no longer tracked by the Master --
and there is no assign currently in flight. In this case, scheduling a
ServerCrashProcedure to clean up the reference to the Unknown Server and to get
the Region reassigned skips out when RegionStateNode in Master has a
RegionLocation that does not match that of the ServerCrashProcedure, even when
it is set to null (we set the RegionLocation to null when we fail an assign as
we might if the server no longer is part of the cluster).
For background, cluster had a RIT. The RIT was a Region failing to open because
of a missing Reference (Another issue). The Region open would fail with a
FileNotFoundException. The master would attempt assign and then would fail when
it went to confirm OPEN, logging the complaint about FNFE asking for operator
intervention in master logs.
This state was in place for weeks on this particular cluster (a dev cluster not
under close observation). The cluster had been restarted once or twice so the
server the Region had once been on was no longer 'known' but it still had an
entry in the hbase:meta table as last location assigned (The now 'Unknown
Server').
To fix, I went about the task in the wrong order. I bypassed the long-running
stuck procedure to terminate it and cleanup 'Procedures and Locks'. Mistake.
Now there was no longer an assign Procedure for this Region. But I now had a
Region in OPENING state with a reference to an unknown server with an in-memory
RegionStateNode whose RegionLocation was null (set null on each failed assign).
Running catalogjanitor_run and hbck_chore_report had the unknown server show in
the 'HBCK Report' in the 'Unknown Servers' list. Attempts at assign fail
because Region is in OPENING state -- you can't assign a Region in OPENING
state. Scheduling an HBCKSCP via hbck2 scheduleRecoveries always generated the
below in the logs.
{code}
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=157217,
state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; HBCKServerCrashProcedure
server=unknown_server.example.com,16020,1587577972683, splitWal=true,
meta=false found a region state=OPENING, location=null, table=bobby_analytics,
region=1501ea3bd822c1a3e4e6216ea48733bd which is no longer on us
unknown_server.example.com,16020,1587577972683, give up assigning...
{code}
My workaround was setting region state to CLOSED with hbck2 and then doing an
assign with hbck2. At this point I noticed the FNFE. Easier if the HBCKSCP
worked.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)