[
https://issues.apache.org/jira/browse/HBASE-23169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957864#comment-16957864
]
Jeongdae Kim commented on HBASE-23169:
--------------------------------------
[~KarthickRam] Your logs say
0) EOF reached in reader thread
2019-10-15 20:32:02,095 TRACE
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.WALEntryStream: Reached the end of log
hdfs://OtherGridMetaCluster/hbasedata/WALs/172.72.72.72,16020,1570500915276/172.72.72.72%2C16020%2C1570500915276.1571196349507,
and the length of the file is 127513914
1) znode "...1571196349507" removed in replication queue.
2019-10-15 20:32:02,101 DEBUG
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.ReplicationSourceManager: Removing 1 logs in the list:
[172.72.72.72%2C16020%2C1570500915276.1571196349507]
2) tried to update log position to the znode already removed by 1).
2019-10-15 20:32:02,198 FATAL
[regionserver//172.72.72.72:16020.replicationSource.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.HRegionServer: ABORTING region server
172.72.72.72,16020,1570500915276: Failed to write replication wal position
(filename=172.72.72.72%2C16020%2C1570500915276.1571196349507,
position=63690)org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for
/hbase/replication/rs/172.72.72.72,16020,1570500915276/2/172.72.72.72%2C16020%2C1570500915276.1571196349507
And my guess is
WAL was rolled before 1), then added new entries which are not
replicated(filtered), so reader updated log position to current read position,
and removed old wals from replication queue 1).
and then a new entry to be replicated was appended and then no entries came for
a while.
2019-10-15 20:32:02,130 TRACE
[regionserver//172.72.72.72:16020.replicationSource.replicationWALReaderThread.172.72.72.72%2C16020%2C1570500915276,2]
regionserver.ReplicationSourceWALReaderThread: Read 1 WAL entries eligible for
replication
So, a batch sent to shipper, then tries to update log position after
replicating an entry.
But, batch might have `lastWalPath` as previous WAL path already rolled,
because `lastWalPath` of batch is point to the wal at the time new batch
created (batch might be created before wal rolled)
I described this problem at HBASE-23205 in details. and we have some issues
about replication with branch-1.
> Random region server aborts while clearing Old Wals
> ---------------------------------------------------
>
> Key: HBASE-23169
> URL: https://issues.apache.org/jira/browse/HBASE-23169
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Replication, wal
> Affects Versions: 1.4.10, 1.4.11
> Reporter: Karthick
> Assignee: Wellington Chevreuil
> Priority: Blocker
> Labels: patch
>
> After applying the patch given in
> [HBASE-22784|https://jira.apache.org/jira/browse/HBASE-22784] random region
> server aborts were noticed. This happens in ReplicationResourceShipper thread
> while writing the replication wal position.
> {code:java}
> 2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)2019-10-05 08:17:28,132 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: ABORTING region server
> 172.20.20.20,16020,1570193969775: Failed to write replication wal position
> (filename=172.20.20.20%2C16020%2C1570193969775.1570288637045,
> position=127494739)org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
> /hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at
> org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
> at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
> 08:17:28,133 FATAL
> [regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)