[jira] [Commented] (HBASE-23169) Random region server aborts while clearing Old Wals

Jeongdae Kim (Jira) Wed, 23 Oct 2019 06:34:02 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957864#comment-16957864
 ]


Jeongdae Kim commented on HBASE-23169:
--------------------------------------

[~KarthickRam]  Your logs say 

0) EOF reached in reader thread
2019-10-15 20:32:02,095 TRACE 
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
 regionserver.WALEntryStream: Reached the end of log 
hdfs://OtherGridMetaCluster/hbasedata/WALs/172.72.72.72,16020,1570500915276/172.72.72.72%2C16020%2C1570500915276.1571196349507,
 and the length of the file is 127513914
 

1) znode "...1571196349507" removed in replication queue.
2019-10-15 20:32:02,101 DEBUG 
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
 regionserver.ReplicationSourceManager: Removing 1 logs in the list: 
[172.72.72.72%2C16020%2C1570500915276.1571196349507]
 

2) tried to update log position to the znode already removed by 1).
2019-10-15 20:32:02,198 FATAL 
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
 regionserver.HRegionServer: ABORTING region server 
172.72.72.72,16020,1570500915276: Failed to write replication wal position 
(filename=172.72.72.72%2C16020%2C1570500915276.1571196349507, 
position=63690)org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/hbase/replication/rs/172.72.72.72,16020,1570500915276/2/172.72.72.72%2C16020%2C1570500915276.1571196349507
 

And my guess is 

WAL was rolled before 1), then added new entries which are not 
replicated(filtered), so reader updated log position to current read position, 
and removed old wals from replication queue 1).

and then a new entry to be replicated was appended and then no entries came for 
a while.
2019-10-15 20:32:02,130 TRACE 
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
 regionserver.ReplicationSourceWALReaderThread: Read 1 WAL entries eligible for 
replication
 

So, a batch sent to shipper, then tries to update log position after 
replicating an entry.

But, batch might have `lastWalPath` as previous WAL path already rolled, 
because `lastWalPath` of batch is point to the wal at the time new batch 
created (batch might be created before wal rolled)

 

I described this problem at HBASE-23205 in details. and we have some issues 
about replication with branch-1.

> Random region server aborts while clearing Old Wals
> ---------------------------------------------------
>
>                 Key: HBASE-23169
>                 URL: https://issues.apache.org/jira/browse/HBASE-23169
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Replication, wal
>    Affects Versions: 1.4.10, 1.4.11
>            Reporter: Karthick
>            Assignee: Wellington Chevreuil
>            Priority: Blocker
>              Labels: patch
>
> After applying the patch given in 
> [HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] random region 
> server aborts were noticed. This happens in ReplicationResourceShipper thread 
> while writing the replication wal position.
> {code:java}
> 2019-10-05 08:17:28,132 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: ABORTING region server 
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position 
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
> position=127494739)2019-10-05 08:17:28,132 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: ABORTING region server 
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position 
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
> position=127494739)org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at 
> org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
>  08:17:28,133 FATAL 
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
>  regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: 
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23169) Random region server aborts while clearing Old Wals

Reply via email to