[
https://issues.apache.org/jira/browse/HBASE-23369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988993#comment-16988993
]
Michael Stack commented on HBASE-23369:
---------------------------------------
On HBASE-21421 (and your 1-3 above):
The accounting mismatches this patch addresses are of a different type: here,
the Master has NO record of the Region the RegionServer is reporting on. The
problem addressed are when we fail the first check in checkOnlineRegionsReport,
not later compares of Master state vs what the RegionServer is reporting
(HBASE-21421 catching UnexpectedStateException).
>From HBASE-21421 description:
bq. I have encountered cases that the regions are closed on the RS and
reportRegionStateTransition to master successfully. But later, a lagging
regionServerReport tells the master the region is online on the RS(Which is not
at the moment, this call may generated some time ago and delayed by network
somehow), the the master think the region should be on another RS, and kill the
RS, which should not be.
In our case, there is no 'killing' anymore since HBASE-21421 -- and if the
above scenario happened, we'd send a close to the RS but there'd be no Region
for it to close (it had already moved) or if in the act of closing, the close
would be ignored/ineffectual.
On relying on operator to run hbck2 to 'fix' this condition, for this
failure-type, hbck2 can't even 'ask' the Master to close the misreporting
Region because Master doesn't know anything about the Region. If double-assign,
Master knows of the Region on server A but not about the incidence on server B.
We could add a direct close call to hbck2 where we go to the RS and ask it to
do a silent close as we used to have in hbck1 -- we should add this anyway --
but at least in my testing on a decent cluster size, sorting the list of what
operations to run can be a large undertaking; meantime HBCK report and logs are
overwhelmed w/ complaint.
> Auto-close 'unknown' Regions reported as OPEN on RegionServers
> --------------------------------------------------------------
>
> Key: HBASE-23369
> URL: https://issues.apache.org/jira/browse/HBASE-23369
> Project: HBase
> Issue Type: Bug
> Reporter: Michael Stack
> Priority: Major
>
> In old days, if a RegionServer reported a variance that didn't agree w/
> Master view of the cluster, we'd kill the RegionServer.
> Lately, in tests that overrun a cluster, after a sustained high-load, Master
> can start failing its updates against Meta (CallQueueTooBigException <= More
> on this later). It then can lose proper accounting of all Region members. One
> variant has a RegionServer reporting its list of open Regions to the Master
> and the Master doesn't 'know' of a particular Region or the Master may know
> the Region but expects it open on another RegionServer.
> Here is an example of how it looks each time RS reports:
> {code}
> 2019-12-03 07:07:00,757 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: No
> t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode
> but reported ONLINE at server.example.org,16020,1575354666245
> (inServerRegionList=false).
> 2019-12-03 07:07:03,793 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: No
> t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode
> but reported ONLINE at server.example.org,16020,1575354666245
> (inServerRegionList=false).
> {code}
> Will also show as an 'inconsistency' in the 'HBCK' tab on the Master UI.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)