Michael Stack created HBASE-24368:
-------------------------------------

             Summary: Let HBCKSCP clear 'Unknown Servers', even if 
RegionStateNode has RegionLocation == null
                 Key: HBASE-24368
                 URL: https://issues.apache.org/jira/browse/HBASE-24368
             Project: HBase
          Issue Type: Bug
          Components: hbck2
    Affects Versions: 2.3.0
            Reporter: Michael Stack


This is an incidental noticed when in a hole trying to fix up a cluster. The 
'obvious' remediation didn't work. This issue is about addressing this.

HBASE-23594 added a filtering of Regions on the crashed server to handle the 
case where an Assign may be concurrent to the ServerCrashProcedure. To avoid 
double assign, the SCP will skip assign if the RegionStateNode RegionLocation 
is not that of the crashed server.

This is good.

Where it is an obstacle is when a Region is stuck in OPENING state, it 
references an 'Unknown Server' -- a server no longer tracked by the Master -- 
and there is no assign currently in flight. In this case, scheduling a 
ServerCrashProcedure to clean up the reference to the Unknown Server and to get 
the Region reassigned skips out when RegionStateNode in Master has a 
RegionLocation that does not match that of the ServerCrashProcedure, even when 
it is set to null (we set the RegionLocation to null when we fail an assign as 
we might if the server no longer is part of the cluster).

For background, cluster had a RIT. The RIT was a Region failing to open because 
of a missing Reference (Another issue). The Region open would fail with a 
FileNotFoundException. The master would attempt assign and then would fail when 
it went to confirm OPEN, logging the complaint about FNFE asking for operator 
intervention in master logs.

This state was in place for weeks on this particular cluster (a dev cluster not 
under close observation). The cluster had been restarted once or twice so the 
server the Region had once been on was no longer 'known' but it still had an 
entry in the hbase:meta table as last location assigned (The now 'Unknown 
Server').

To fix, I went about the task in the wrong order. I bypassed the long-running 
stuck procedure to terminate it and cleanup 'Procedures and Locks'. Mistake. 
Now there was no longer an assign Procedure for this Region. But I now had a 
Region in OPENING state with a reference to an unknown server with an in-memory 
RegionStateNode whose RegionLocation was null (set null on each failed assign). 
Running catalogjanitor_run and hbck_chore_report had the unknown server show in 
the 'HBCK Report' in the 'Unknown Servers' list. Attempts at assign fail 
because Region is in OPENING state -- you can't assign a Region in OPENING 
state. Scheduling an HBCKSCP via hbck2 scheduleRecoveries always generated the 
below in the logs.

{code}
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=157217, 
state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; HBCKServerCrashProcedure 
server=unknown_server.example.com,16020,1587577972683, splitWal=true, 
meta=false found a region state=OPENING, location=null, table=bobby_analytics, 
region=1501ea3bd822c1a3e4e6216ea48733bd which is no longer on us 
unknown_server.example.com,16020,1587577972683, give up assigning...
{code}

My workaround was setting region state to CLOSED with hbck2 and then doing an 
assign with hbck2. At this point I noticed the FNFE. Easier if the HBCKSCP 
worked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to