[jira] [Commented] (HBASE-23369) Auto-close 'unknown' Regions reported as OPEN on RegionServers

Michael Stack (Jira) Thu, 05 Dec 2019 09:18:47 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-23369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988993#comment-16988993
 ]


Michael Stack commented on HBASE-23369:
---------------------------------------

On HBASE-21421 (and your 1-3 above):

The accounting mismatches this patch addresses are of a different type: here, 
the Master has NO record of the Region the RegionServer is reporting on. The 
problem addressed are when we fail the first check in checkOnlineRegionsReport, 
not later compares of Master state vs what the RegionServer is reporting 
(HBASE-21421 catching UnexpectedStateException).

>From HBASE-21421 description:

bq. I have encountered cases that the regions are closed on the RS and 
reportRegionStateTransition to master successfully. But later, a lagging 
regionServerReport tells the master the region is online on the RS(Which is not 
at the moment, this call may generated some time ago and delayed by network 
somehow), the the master think the region should be on another RS, and kill the 
RS, which should not be.

In our case, there is no 'killing' anymore since HBASE-21421 -- and if the 
above scenario happened, we'd send a close to the RS but there'd be no Region 
for it to close (it had already moved) or if in the act of closing, the close 
would be ignored/ineffectual.

On relying on operator to run hbck2 to 'fix' this condition, for this 
failure-type, hbck2 can't even 'ask' the Master to close the misreporting 
Region because Master doesn't know anything about the Region. If double-assign, 
Master knows of the Region on server A but not about the incidence on server B. 
We could add a direct close call to hbck2 where we go to the RS and ask it to 
do a silent close as we used to have in hbck1 -- we should add this anyway -- 
but at least in my testing on a decent cluster size, sorting the list of what 
operations to run can be a large undertaking; meantime HBCK report and logs are 
overwhelmed w/ complaint.





> Auto-close 'unknown' Regions reported as OPEN on RegionServers
> --------------------------------------------------------------
>
>                 Key: HBASE-23369
>                 URL: https://issues.apache.org/jira/browse/HBASE-23369
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Michael Stack
>            Priority: Major
>
> In old days, if a RegionServer reported a variance that didn't agree w/ 
> Master view of the cluster, we'd kill the RegionServer.
> Lately, in tests that overrun a cluster, after a sustained high-load, Master 
> can start failing its updates against Meta (CallQueueTooBigException <= More 
> on this later). It then can lose proper accounting of all Region members. One 
> variant has a RegionServer reporting its list of open Regions to the Master 
> and the Master doesn't 'know' of a particular Region or the Master may know 
> the Region but expects it open on another RegionServer.
> Here is an example of how it looks each time RS reports:
> {code}
>  2019-12-03 07:07:00,757 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: No 
> t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode 
> but reported ONLINE at server.example.org,16020,1575354666245 
> (inServerRegionList=false).
>  2019-12-03 07:07:03,793 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: No 
> t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode 
> but reported ONLINE at server.example.org,16020,1575354666245 
> (inServerRegionList=false).
> {code}
> Will also show as an 'inconsistency' in the 'HBCK' tab on the Master UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23369) Auto-close 'unknown' Regions reported as OPEN on RegionServers

Reply via email to