[
https://issues.apache.org/jira/browse/HBASE-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955751#comment-16955751
]
Karthick commented on HBASE-23169:
----------------------------------
[~wchevreuil] We have 1.4.10 deployed on our production clusters. We checked
for conflicts in 1.4.10 with the patch in
[HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] and since there
were no conflicts we applied the patch. And please note the fact that the
region server aborts happen randomly. At the moment we have restart mechanisms
but because of this issue we are not able to apply the patch in all out
clusters.
> Random region server aborts while clearing Old Wals
> ---------------------------------------------------
>
> Key: HBASE-23169
> URL: https://issues.apache.org/jira/browse/HBASE-23169
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Replication, wal
> Affects Versions: 1.4.10, 1.4.11
> Reporter: Karthick
> Assignee: Wellington Chevreuil
> Priority: Blocker
> Labels: patch
>
> After applying the patch given in
> [HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] random region
> server aborts were noticed. This happens in ReplicationResourceShipper thread
> while writing the replication wal position.
> {code:java}
> 2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
> /hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
> at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
> 08:17:28,133 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)