[jira] [Commented] (HBASE-23169) Random region server aborts while clearing Old Wals

Wellington Chevreuil (Jira) Fri, 18 Oct 2019 03:55:46 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954492#comment-16954492
 ]


Wellington Chevreuil commented on HBASE-23169:
----------------------------------------------

Hi [~KarthickRam], I noticed you marked this is affecting 1.4.11, but I 
couldn't reproduce this issue in my previous tests:
{noformat}
Tried active-active replication, set a table for replication on both, verified 
that replication is working both ways, then ran ltt for a new, not targeted to 
replication table, then verified oldWALs are getting deleted and log position 
reported by DumpReplicationQueues is updated even when we have only edits not 
targeted for replication. {noformat}

You mentioned you had applied HBASE-22784 patch on top of 1.4.10. Did it 
applied cleanly, or were there some conflicts? Have you also managed to 
reproduce this on a 1.4.11 deployment?

> Random region server aborts while clearing Old Wals
> ---------------------------------------------------
>
>                 Key: HBASE-23169
>                 URL: https://issues.apache.org/jira/browse/HBASE-23169
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Replication, wal
>    Affects Versions: 1.4.10, 1.4.11
>            Reporter: Karthick
>            Assignee: Wellington Chevreuil
>            Priority: Blocker
>              Labels: patch
>
> After applying the patch given in 
> [HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] random region 
> server aborts were noticed. This happens in ReplicationResourceShipper thread 
> while writing the replication wal position.
> {code:java}
> 2019-10-05 08:17:28,132 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: ABORTING region server 
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position 
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
> position=127494739)2019-10-05 08:17:28,132 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: ABORTING region server 
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position 
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
> position=127494739)org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at 
> org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
>  08:17:28,133 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: 
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23169) Random region server aborts while clearing Old Wals

Reply via email to