Zheng Wang created HBASE-23008:
----------------------------------
Summary: ReplicationSourceShipper has no chance to delete hlog
znode when the wal entry batch always empty
Key: HBASE-23008
URL: https://issues.apache.org/jira/browse/HBASE-23008
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 2.0.0
Reporter: Zheng Wang
My live cluster config master-master replication,and only one is used to put
data,as active cluster.
Recently ,i find there are a great many znode in
/hbase/replication/rs/#host/#peer in backup cluster,at least 10000+.
I think the reason is , the wal entry in backup cluster are filtered by
ClusterMarkingEntryFilter totaly, so ReplicationSourceWALReader will not put
any data to entryBatchQueue,and ReplicationSourceShipper always blocked at
entryReader.take(),it has no chance to delete hlog znode.
The thread stack of walReader and walShiper is below:
{code:java}
"main-EventThread.replicationSource,2.replicationSource.bj1-203-centos17%2C16020%2C1567586932902.bj1-203-centos17%2C16020%2C1567586932902.regiongroup-0,2.replicationSource.wal-reader.bj1-203-centos17%2C16020%2C1567586932902.bj1-203-centos17%2C16020%2C1567586932902.regiongroup-0,2"
#157238 daemon prio=5 os_prio=0 tid=0x00007f7634be8800 nid=0x377ef waiting on
condition
[0x00007f6114c0e000]"main-EventThread.replicationSource,2.replicationSource.bj1-203-centos17%2C16020%2C1567586932902.bj1-203-centos17%2C16020%2C1567586932902.regiongroup-0,2.replicationSource.wal-reader.bj1-203-centos17%2C16020%2C1567586932902.bj1-203-centos17%2C16020%2C1567586932902.regiongroup-0,2"
#157238 daemon prio=5 os_prio=0 tid=0x00007f7634be8800 nid=0x377ef waiting on
condition [0x00007f6114c0e000] java.lang.Thread.State: TIMED_WAITING
(sleeping) at java.lang.Thread.sleep(Native Method) at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.handleEmptyWALEntryBatch(ReplicationSourceWALReader.java:192)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:142)
"main-EventThread.replicationSource,2.replicationSource.bj1-203-centos17%2C16020%2C1567586932902.bj1-203-centos17%2C16020%2C1567586932902.regiongroup-0,2"
#157237 daemon prio=5 os_prio=0 tid=0x00007f76350b0000 nid=0x377ee waiting on
condition [0x00007f6108173000] java.lang.Thread.State: WAITING (parking) at
sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f6f99bb6718>
(a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.take(ReplicationSourceWALReader.java:248)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:108)
{code}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)