[
https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024129#comment-16024129
]
Guanghao Zhang commented on HBASE-18111:
----------------------------------------
bq. Have you taken look at ZOOKEEPER-2785?
It maybe a reason. HBaseInterClusterReplicationEndpoint is a hbase client and
write entries to peer cluster. We should handle the connection close case no
matter what reason lead it? And now the replication will stuck in the while
loop. You have to restart the RS and let another RS help to replicate the
log......
> Replication stuck when cluster connection is closed
> ---------------------------------------------------
>
> Key: HBASE-18111
> URL: https://issues.apache.org/jira/browse/HBASE-18111
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
> Reporter: Guanghao Zhang
> Assignee: Guanghao Zhang
> Attachments: HBASE-18111.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)]
> org.apache.zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum
> member failed: javax.security.sasl.SaslException: An error:
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException:
> GSS initiate failed [Caused by GSSException: No valid credentials provided
> (Mechanism level: Connection reset)]) occurred when evaluating Zookeeper
> Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED
> state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread]
> org.apache.hadoop.hbase.client.HConnectionImplementation:
> hconnection-0x1148dd9b-0x35b6b4d4ca999c6,
> quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
> baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6
> received auth failed from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode =
> AuthFailed
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread]
> org.apache.hadoop.hbase.client.HConnectionImplementation: Closing zookeeper
> sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800]
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
> Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local
> exception: java.io.IOException: Connection closed
> {code}
> jstack
> {code}
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
> at
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a
> AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate()
> method will stuck in a while loop.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)