[
https://issues.apache.org/jira/browse/HBASE-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950787#comment-16950787
]
Wellington Chevreuil commented on HBASE-23169:
----------------------------------------------
Any chances we can have an RS log with TRACE enabled uploaded, so that we can
have more context on what was happening prior to the FATAL message?
> Random region server aborts while clearing Old Wals
> ---------------------------------------------------
>
> Key: HBASE-23169
> URL: https://issues.apache.org/jira/browse/HBASE-23169
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Replication, wal
> Affects Versions: 1.4.10, 1.4.11
> Reporter: Karthick
> Assignee: Wellington Chevreuil
> Priority: Blocker
> Labels: patch
>
> After applying the patch given inĀ
> [HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] random region
> server aborts were noticed. This happens in ReplicationResourceShipper thread
> while writing the replication wal position.
> {code:java}
> 2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
> /hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
> at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
> 08:17:28,133 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)