Michael Stack created HBASE-23282:
-------------------------------------

             Summary: HBCKServerCrashProcedure for 'Unknown Servers'
                 Key: HBASE-23282
                 URL: https://issues.apache.org/jira/browse/HBASE-23282
             Project: HBase
          Issue Type: Bug
          Components: hbck2, proc-v2
    Affects Versions: 2.2.2
            Reporter: Michael Stack


With an overdriving, sustained load, I can fairly easily manufacture an 
hbase:meta table that references servers that are no longer in the live list 
nor are members of deadservers; i.e. 'Unknown Servers'.  The new 'HBCK Report' 
UI in Master has a section where it lists 'Unknown Servers' if any in 
hbase:meta.

Once in this state, the repair is awkward. Our assign/unassign Procedure is 
particularly dogged about insisting that we confirm close/open of Regions when 
it is going about its business which is well and good if server is in live/dead 
sets but when an 'Unknown Server', we invariably end up trying to confirm 
against a non-longer present server (More on this in follow-on issues).

What is wanted is queuing of a ServerCrashProcedure for each 'Unknown Server'. 
It would split any WALs (there shouldn't be any if server was restarted) and 
ideally it would cancel out any assigns and reassign regions off the 'Unknown 
Server'.  But the 'normal' SCP consults the in-memory cluster state figuring 
what Regions were on the crashed server... And 'Unknown Servers' don't have 
state in in-master memory Maps of Servers to Regions or  in DeadServers list 
which works fine for the usual case.

Suggestion here is that hbck2 be able to drive in a special SCP, one which 
would get list of Regions by scanning hbase:meta rather than asking Master 
memory; an HBCKSCP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to