[ 
https://issues.apache.org/jira/browse/HBASE-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504491#comment-17504491
 ] 

Bryan Beaudreault commented on HBASE-26783:
-------------------------------------------

I looked into the above nightly build failure. This is a legit failure for 
branch-1, though branch-2 is unaffected.

In branch-2, the meta location is cached into the MetaCache. So calls from 
RegionServerCallable.throwable(), which calls 
ConnectionImplementation.updateCachedLocations, will appropriately clear cache 
when errors are encountered while scanning meta. Thus, on retry, 
ScannerCallable.prepare() will pull a new location for meta. This is as 
expected in my development of the branch-2 PR.

In branch-1, the meta location is *not* cached in MetaCache, since 
https://issues.apache.org/jira/browse/HBASE-21464. So 
ConnectionImplementation.updateCachedLocations is a no-op when an error is 
encountered while scanning meta. This is because that method checks the current 
MetaCache for any cached locations and exits early if there are none. Since 
meta is not in MetaCache, there are no cached locations. Thus, on retry, 
ScannerCallable.prepare() will continue to see the old bad location for meta.

The fix for branch-1 could be simply this change in 
ConnectionManager.updateCachedLocations:
{code:java}
if (tableName.equals(TableName.META_TABLE_NAME)) {
  clearMetaRegionLocation();
  return;
} {code}
I've verified that that change fixes the failing test from the nightly build – 
TestReplicationSyncUpTool – but I don't quite have time to think about all 
other possible ramifications or write up a test to cover this case. I might be 
inclined to revert the branch-1 patch since this was just opportunistically 
backported since branch-1 is EOL. Thoughts [~apurtell]?

> ScannerCallable doubly clears meta cache on retries
> ---------------------------------------------------
>
>                 Key: HBASE-26783
>                 URL: https://issues.apache.org/jira/browse/HBASE-26783
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.10
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>             Fix For: 2.5.0, 1.8.0, 1.7.2, 2.6.0, 2.4.11
>
>
> Way back in HBASE-15658 [~ghelmling] fixed RegionServerCallable to not clear 
> meta in {{{}prepare(boolean reload){}}}. because it already would have 
> cleared it in the try/catch when {{{}throwable(Throwable t, boolean 
> retrying){}}}.
> I have recently been doing some load tests where I am causing HBase 
> RegionServers to throw many CallDroppedExceptions because they are overloaded 
> by the test. While this is an extreme example, it does sometimes crop up in 
> production when a bad actor executes a job without rate limiting, etc. What I 
> noticed was that the RegionServer hosting meta was most affected by the load, 
> way more than any other server in the cluster. Digging into the issue I 
> realized that the extra meta load was coming mostly from the scans, 
> originating from {{{}ScannerCallable.prepare(boolean reload){}}}.
> I'm not sure why ScannerCallable was excluded from the original jira, maybe 
> an oversight. But ScannerCallable is called in the same context of 
> RetryingRpcCaller, which will handle clearing meta in the try/catch like 
> other callables. We should similarly update ScannerCallable's prepare method 
> not always pass useCache=true when getting region locations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to