[ 
https://issues.apache.org/jira/browse/HBASE-23369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988515#comment-16988515
 ] 

Michael Stack commented on HBASE-23369:
---------------------------------------

Patch to do force-close of the unexpected/unknown. If unknown Region or 
double-assigned, then force-close needs no more follow-up. If it something like 
a Read Replica or a Region that should be online, operator may need to then 
online or teach Master about the 'unknown'.

Have been running this patch on large test cluster. Helped tamp down the number 
of errors to fix. The patch also includes tuning of the HBCKSCP recently 
added... It is too greedy in the list of Regions it goes to reassign especially 
if Master is damaged -- add filter/checks. HBCKSCP before this patch was good 
at manufacturing double-assign which was useful testing the auto-close portion 
of this patch.

No hurry on commit. Still testing but might be of interest....

> Auto-close 'unknown' Regions reported as OPEN on RegionServers
> --------------------------------------------------------------
>
>                 Key: HBASE-23369
>                 URL: https://issues.apache.org/jira/browse/HBASE-23369
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Michael Stack
>            Priority: Major
>
> In old days, if a RegionServer reported a variance that didn't agree w/ 
> Master view of the cluster, we'd kill the RegionServer.
> Lately, in tests that overrun a cluster, after a sustained high-load, Master 
> can start failing its updates against Meta (CallQueueTooBigException <= More 
> on this later). It then can lose proper accounting of all Region members. One 
> variant has a RegionServer reporting its list of open Regions to the Master 
> and the Master doesn't 'know' of a particular Region or the Master may know 
> the Region but expects it open on another RegionServer.
> Here is an example of how it looks each time RS reports:
> {code}
>  2019-12-03 07:07:00,757 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: No 
> t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode 
> but reported ONLINE at server.example.org,16020,1575354666245 
> (inServerRegionList=false).
>  2019-12-03 07:07:03,793 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: No 
> t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode 
> but reported ONLINE at server.example.org,16020,1575354666245 
> (inServerRegionList=false).
> {code}
> Will also show as an 'inconsistency' in the 'HBCK' tab on the Master UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to