[
https://issues.apache.org/jira/browse/HBASE-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329503#comment-16329503
]
Scott Wilson commented on HBASE-19816:
--------------------------------------
The patch addresses a symptom of a root cause of a zookeeper watch not being
triggered. Are there known issues with watches?
> Replication sink list is not updated on UnknownHostException
> ------------------------------------------------------------
>
> Key: HBASE-19816
> URL: https://issues.apache.org/jira/browse/HBASE-19816
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.0.0, 1.2.0
> Environment: We have two clusters set up with bi-directional
> replication. The clusters are around 400 nodes each and hosted in AWS.
> Reporter: Scott Wilson
> Priority: Major
> Attachments: HBASE-19816.master.001.patch
>
>
> We have two clusters, call them 1 and 2. Cluster 1 was the current "primary"
> cluster and taking all live traffic which is replicated to cluster 2. We
> decommissioned several instances in cluster 2 which involves deleting the
> instance and its DNS record. After this happened most of the regions servers
> in cluster 1 showed this message in their logs repeatedly.
>
> {code}
> 2018-01-12 23:49:36,507 WARN
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
> Can't replicate because of a local or network error:
> java.net.UnknownHostException: data-017b.hbase-2.prod
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.<init>(AbstractRpcClient.java:315)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.createBlockingRpcChannel(AbstractRpcClient.java:267)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getAdmin(ConnectionManager.java:1737)
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getAdmin(ConnectionManager.java:1719)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager.getReplicationSink(ReplicationSinkManager.java:119)
> at
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:339)
> at
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:326)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> The host data-017b.hbase-2.prod was one of those that had been removed from
> cluster 2. Next we observed our replication lag from cluster 1 to cluster 2
> was elevated. Some region servers reported ageOfLastShippedOperation to be
> close to an hour.
> The only way we found to clear the message was to restart the region servers
> that showed this message in the log. Once we did replication returned to
> normal. Restarting the affected region servers in cluster 1 took several days
> because we could not bring the cluster down.
> From reading the code it appears the cause was the zookeeper watch not being
> triggered for the region server list change in cluster 2. We verified the
> list in zookeeper for cluster 2 was correct and did not include the removed
> nodes.
> One concrete improvement to make would be to force a refresh of the sink
> cluster region server list when an {{UnknownHostException}} is found. This is
> already done if the there is a {{ConnectException}} in
> {{HBaseInterClusterReplicationEndpoint.java}}
> {code:java}
> } else if (ioe instanceof ConnectException) {
> LOG.warn("Peer is unavailable, rechecking all sinks: ", ioe);
> replicationSinkMgr.chooseSinks();
> {code}
> I propose that should be extended to cover {{UnknownHostException}}.
> We observed this behavior on 1.2.0-cdh-5.11.1 but it appears the same code
> still exists on the current master branch.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)