[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

Michael Stack (Jira) Sat, 15 Feb 2020 14:03:41 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-23282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037637#comment-17037637
 ]


Michael Stack commented on HBASE-23282:
---------------------------------------

[~jfrabaute] ok.

FYI, edit of hbase:meta directly is problematic because master does not 'learn' 
of the edit made. It writes all state to hbase:meta so doesn't expect it has to 
read it.

On running scheduleProcedureRecoveries, you should at least see the procedure 
registered in the master log.

Good you try serveral times. HBCKSCP will only do extra repair if straight-SCP 
finds nothing to be done.

Yeah, recover w/ the bulk load. It should show in the HBCK Report as orphan.

Need to figure how this condition comes about. It was happening in tests here 
because i was overdriving the cluster but had fixed most problem-causing 
conditions.

Why it happens,  need to figure.

The Region is in the OPENING state but the Region it is OPENING against is gone 
for whatever reason (Restart of a server will clear the old instance from the 
dead servers list making it a no-longer 'known' server). The HBCKSCP is 
explicitly for this case where hbase:meta has a reference but the server does 
not exist anymore. You didn't see...  any logs with:

"     LOG.info("Found {} mentions of {} in hbase:meta of OPEN/OPENING Regions: 
{}","

.... odd that CatalogJanitor could see the server but HBCKSCP could not.

You did regenerate the HBCK Report after running HBCKSCP a few times? 
Otherwise, HBCKSCP fixes the issue but the report is stale until CatalogJanitor 
is re-run.

Thanks.

> HBCKServerCrashProcedure for 'Unknown Servers'
> ----------------------------------------------
>
>                 Key: HBASE-23282
>                 URL: https://issues.apache.org/jira/browse/HBASE-23282
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2, proc-v2
>    Affects Versions: 2.2.2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0, 2.2.3
>
>
> With an overdriving, sustained load, I can fairly easily manufacture an 
> hbase:meta table that references servers that are no longer in the live list 
> nor are members of deadservers; i.e. 'Unknown Servers'.  The new 'HBCK 
> Report' UI in Master has a section where it lists 'Unknown Servers' if any in 
> hbase:meta.
> Once in this state, the repair is awkward. Our assign/unassign Procedure is 
> particularly dogged about insisting that we confirm close/open of Regions 
> when it is going about its business which is well and good if server is in 
> live/dead sets but when an 'Unknown Server', we invariably end up trying to 
> confirm against a non-longer present server (More on this in follow-on 
> issues).
> What is wanted is queuing of a ServerCrashProcedure for each 'Unknown 
> Server'. It would split any WALs (there shouldn't be any if server was 
> restarted) and ideally it would cancel out any assigns and reassign regions 
> off the 'Unknown Server'.  But the 'normal' SCP consults the in-memory 
> cluster state figuring what Regions were on the crashed server... And 
> 'Unknown Servers' don't have state in in-master memory Maps of Servers to 
> Regions or  in DeadServers list which works fine for the usual case.
> Suggestion here is that hbck2 be able to drive in a special SCP, one which 
> would get list of Regions by scanning hbase:meta rather than asking Master 
> memory; an HBCKSCP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

Reply via email to