[ 
https://issues.apache.org/jira/browse/HBASE-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955751#comment-16955751
 ] 

Karthick commented on HBASE-23169:
----------------------------------

[~wchevreuil] We have 1.4.10 deployed on our production clusters. We checked 
for conflicts in 1.4.10 with the patch in 
[HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] and since there 
were no conflicts we applied the patch. And please note the fact that the 
region server aborts happen randomly. At the moment we have restart mechanisms 
but because of this issue we are not able to apply the patch in all out 
clusters.

> Random region server aborts while clearing Old Wals
> ---------------------------------------------------
>
>                 Key: HBASE-23169
>                 URL: https://issues.apache.org/jira/browse/HBASE-23169
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Replication, wal
>    Affects Versions: 1.4.10, 1.4.11
>            Reporter: Karthick
>            Assignee: Wellington Chevreuil
>            Priority: Blocker
>              Labels: patch
>
> After applying the patch given in 
> [HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] random region 
> server aborts were noticed. This happens in ReplicationResourceShipper thread 
> while writing the replication wal position.
> {code:java}
> 2019-10-05 08:17:28,132 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: ABORTING region server 
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position 
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
> position=127494739)2019-10-05 08:17:28,132 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: ABORTING region server 
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position 
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
> position=127494739)org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at 
> org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
>  08:17:28,133 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: 
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to