Charles Connell created HBASE-29502:
---------------------------------------

             Summary: RegionReplicaReplicationEndpoint fails to forward 
mutations when meta cache does not contain secondary replica locations
                 Key: HBASE-29502
                 URL: https://issues.apache.org/jira/browse/HBASE-29502
             Project: HBase
          Issue Type: Bug
          Components: read replicas
            Reporter: Charles Connell
            Assignee: Charles Connell


When region replicas are enabled in "asynchronous WAL replication" mode, each 
RegionServer uses a {{RegionReplicaReplicationEndpoint}} object to tail its own 
WAL. Each mutation in its WAL may be related to a region which has its primary 
replica on this RegionServer, and has one or more secondary replicas on other 
servers. So, for each mutation in the WAL, {{RegionReplicaReplicationEndpoint}} 
decides whether any other servers are hosting replicas of the relevant region, 
and if so, sends an RPC to those servers containing the mutations they should 
apply to their memstores.

When region replicas are enabled, a {{RegionReplicaReplicationEndpoint}} 
instance is created, with its own {{ConnectionImplementation}} and therefore 
its own {{MetaCache}}. This {{RegionReplicaReplicationEndpoint}} immediately 
starts attempting to send mutations to secondary replica regions, even though 
they will not be open for a few more seconds or minutes. In this moment, the 
{{MetaCache}} gets populated with entries that say that most regions are hosted 
on only one server. These cached lookups remain in use indefinitely, even 
though they are incorrect for most of their lifetime. Without knowing where the 
secondary replica regions are hosted, or if they exist at all, the 
{{RegionReplicaReplicationEndpoint}} cannot forward mutations to them.

{{RegionReplicaReplicationEndpoint}} actually contains cache-busting logic 
seemingly designed to fix this exact problem:
{code:java}
// Replicas can take a while to come online. The cache may have only the 
primary. If we
// keep going to the cache, we will not learn of the replicas and their 
locations after
// they come online.
if (useCache && locations.size() == 1 && TableName.isMetaTableName(tableName)) {
  if (tableDescriptors.get(tableName).getRegionReplication() > 1) {
    // Make an obnoxious log here. See how bad this issue is. Add a timer if 
happening
    // too much.
    LOG.info("Skipping location cache; only one location found for {}", 
tableName);
    useCache = false;
    continue;
  }
}
{code}

However, because of the {{TableName.isMetaTableName(tableName)}} clause, the 
cache-busting only takes effect if the mutation being forwarded belongs to the 
meta table. I don't know why that restriction would make sense.

In this ticket I plan to just remove the "is meta table" clause to fix this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to