[ 
https://issues.apache.org/jira/browse/HBASE-20561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478323#comment-16478323
 ] 

Duo Zhang commented on HBASE-20561:
-----------------------------------

Check KeeperException in ReplicationSourceManager is a bit strange, as we have 
a storage interface layer which hides the implementation detail, that's why we 
use ReplicationException instead of KeeperException.

And is it safe to use Thread.isInterrupted? Not sure whether the zookeeper 
implementation will restore the interrupted flag...

> The way we stop a ReplicationSource may cause the RS down
> ---------------------------------------------------------
>
>                 Key: HBASE-20561
>                 URL: https://issues.apache.org/jira/browse/HBASE-20561
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Duo Zhang
>            Assignee: Guanghao Zhang
>            Priority: Major
>         Attachments: HBASE-20561.master.001.patch
>
>
> See this:
> https://builds.apache.org/job/HBASE-Flaky-Tests/31125/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleAsyncWAL-output.txt
> {noformat}
> 2018-05-09 15:07:00,887 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.RefreshPeerCallable(52): Received a peer change event, peerId=2, 
> type=REMOVE_PEER
> 2018-05-09 15:07:00,890 INFO  [RS_REFRESH_PEER-regionserver/asf916:0-1] 
> regionserver.ReplicationSource(485): Closing source 
> 2-asf916.gq1.ygridcore.net,36287,1525878368395 because: Replication stream 
> was removed by a user
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at java.lang.Object.wait(Object.java:502)
>       at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>       at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:871)
>       at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:166)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1231)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>       at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,892 DEBUG 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-1,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  zookeeper.ZKWatcher(617): regionserver:34308-0x163456ff2490004, 
> quorum=localhost:60149, baseZNode=/1 Received InterruptedException, will 
> interrupt current thread and rethrow a SystemErrorException
> java.lang.InterruptedException
>       at java.lang.Object.wait(Native Method)
>       at java.lang.Object.wait(Object.java:502)
>       at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406)
>       at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:990)
>       at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
>       at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:663)
>       at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1690)
>       at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.setWALPosition(ZKReplicationQueueStorage.java:246)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$logPositionAndCleanOldLogs$7(ReplicationSourceManager.java:487)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:487)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> 2018-05-09 15:07:00,896 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=4,queue=0,port=36902] 
> master.MasterRpcServices(1141): Checking to see if procedure is done pid=52
> 2018-05-09 15:07:00,898 ERROR 
> [ReplicationExecutor-0.replicationSource,2-asf916.gq1.ygridcore.net,36287,1525878368395.replicationSource.shipperasf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0,2-asf916.gq1.ygridcore.net,36287,1525878368395]
>  helpers.MarkerIgnoringBase(159): ***** ABORTING region server 
> asf916.gq1.ygridcore.net,34308,1525878368287: Failed to operate on 
> replication queue *****
> org.apache.hadoop.hbase.replication.ReplicationException: Failed to remove 
> wal from queue (serverName=asf916.gq1.ygridcore.net,34308,1525878368287, 
> queueId=2-asf916.gq1.ygridcore.net,36287,1525878368395, 
> fileName=asf916.gq1.ygridcore.net%2C36287%2C1525878368395.asf916.gq1.ygridcore.net%2C36287%2C1525878368395.regiongroup-0.1525878371229)
>       at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:202)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$cleanOldLogs$8(ReplicationSourceManager.java:526)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.abortWhenFail(ReplicationSourceManager.java:454)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:526)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.cleanOldLogs(ReplicationSourceManager.java:506)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:489)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:231)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:133)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:103)
> Caused by: org.apache.zookeeper.KeeperException$SystemErrorException: 
> KeeperErrorCode = SystemError
>       at 
> org.apache.hadoop.hbase.zookeeper.ZKWatcher.interruptedException(ZKWatcher.java:608)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1236)
>       at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1220)
>       at 
> org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.removeWAL(ZKReplicationQueueStorage.java:198)
>       ... 8 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to