Charles Connell created HBASE-29502: ---------------------------------------
Summary: RegionReplicaReplicationEndpoint fails to forward mutations when meta cache does not contain secondary replica locations Key: HBASE-29502 URL: https://issues.apache.org/jira/browse/HBASE-29502 Project: HBase Issue Type: Bug Components: read replicas Reporter: Charles Connell Assignee: Charles Connell When region replicas are enabled in "asynchronous WAL replication" mode, each RegionServer uses a {{RegionReplicaReplicationEndpoint}} object to tail its own WAL. Each mutation in its WAL may be related to a region which has its primary replica on this RegionServer, and has one or more secondary replicas on other servers. So, for each mutation in the WAL, {{RegionReplicaReplicationEndpoint}} decides whether any other servers are hosting replicas of the relevant region, and if so, sends an RPC to those servers containing the mutations they should apply to their memstores. When region replicas are enabled, a {{RegionReplicaReplicationEndpoint}} instance is created, with its own {{ConnectionImplementation}} and therefore its own {{MetaCache}}. This {{RegionReplicaReplicationEndpoint}} immediately starts attempting to send mutations to secondary replica regions, even though they will not be open for a few more seconds or minutes. In this moment, the {{MetaCache}} gets populated with entries that say that most regions are hosted on only one server. These cached lookups remain in use indefinitely, even though they are incorrect for most of their lifetime. Without knowing where the secondary replica regions are hosted, or if they exist at all, the {{RegionReplicaReplicationEndpoint}} cannot forward mutations to them. {{RegionReplicaReplicationEndpoint}} actually contains cache-busting logic seemingly designed to fix this exact problem: {code:java} // Replicas can take a while to come online. The cache may have only the primary. If we // keep going to the cache, we will not learn of the replicas and their locations after // they come online. if (useCache && locations.size() == 1 && TableName.isMetaTableName(tableName)) { if (tableDescriptors.get(tableName).getRegionReplication() > 1) { // Make an obnoxious log here. See how bad this issue is. Add a timer if happening // too much. LOG.info("Skipping location cache; only one location found for {}", tableName); useCache = false; continue; } } {code} However, because of the {{TableName.isMetaTableName(tableName)}} clause, the cache-busting only takes effect if the mutation being forwarded belongs to the meta table. I don't know why that restriction would make sense. In this ticket I plan to just remove the "is meta table" clause to fix this bug. -- This message was sent by Atlassian Jira (v8.20.10#820010)