[
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899460#comment-16899460
]
Solvannan R M edited comment on HBASE-22784 at 8/3/19 1:07 PM:
---------------------------------------------------------------
Hi [~wchevreuil],
Thanks for the reply !
We have managed to extract the following information relating to the issue:
*HBase DFS Usage Report:*
We could see the oldWALs size is very high compared to actual data size
{code:java}
[email protected]~>./hadoop-2.7.3/bin/hdfs dfs -du -h /hbasedata
0 /hbasedata/.tmp
0 /hbasedata/xxxxxx
0 /hbasedata/MasterProcWALs
50.3 G /hbasedata/WALs
0 /hbasedata/archive
0 /hbasedata/corrupt
561.4 G /hbasedata/data
0 /hbasedata/hbase
42 /hbasedata/hbase.id
7 /hbasedata/hbase.version
405.9 G /hbasedata/oldWALs
0 /hbasedata/xxxx
0 /hbasedata/xxxx
0 /hbasedata/xxxxx{code}
*Zookeeper myQueuesZnodes:*
The replication queues entries are not cleared from zookeeper for a long time
{code:java}
[zk: 10.216.xxx.xxx:2191,10.216.xxx.xxx:2191,10.216.xxx.xxx:2191(CONNECTED) 3]
get
/hbase/replication/rs/10.216.xxx.xxx,16020,1554361253037/3/10.216.xxx.xxx%2C16020%2C1554361253037.1564
Display all 179 possibilities? (y or n)
10.216.xxx.xxx%2C16020%2C1554361253037.1554638617629
10.216.xxx.xxx%2C16020%2C1554361253037.1554485372383
....
{code}
*Status 'replication' output of the regionserver:*
{code:java}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1,
TimeStampsOfLastShippedOp=Wed May 01 07:40:55 IST 2019, Replication
Lag=8158202958 PeerID=3, AgeOfLastShippedOp=0, SizeOfLogQueue=1,
TimeStampsOfLastShippedOp=Thu Jan 01 05:30:00 IST 1970, Replication
Lag=1564834858373 SINK : AgeOfLastAppliedOp=378, TimeStampsOfLastAppliedOp=Sat
Aug 03 17:50:44 IST 2019
{code}
*Stacktrace of the replicationSource thread:*
We can see the thread has been blocked for a long time.
{code:java}
"main-EventThread.replicationSource,3.replicationSource.10.216.xxx.xxx%2C16020%2C1554360804184,3"
#10121291 daemon prio=5 os_prio=0 tid=0x00007f00e14ea000 nid=0x6d4c0 waiting
on condition [0x00007ef7598c9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007ef8889a1b98> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)
{code}
Kindly let us know if you need any specific logs for your analysis.
was (Author: solvannan):
Hi [~wchevreuil],
Thanks for the reply !
We have managed to extract the following information relating to the issue:
*HBase DFS Usage Report:*
We could see the oldWALs size is very high compared to actual data size
{code:java}
[email protected]~>./hadoop-2.7.3/bin/hdfs dfs -du -h /hbasedata
0 /hbasedata/.tmp
0 /hbasedata/xxxxxx
0 /hbasedata/MasterProcWALs
50.3 G /hbasedata/WALs
0 /hbasedata/archive
0 /hbasedata/corrupt
561.4 G /hbasedata/data
0 /hbasedata/hbase
42 /hbasedata/hbase.id
7 /hbasedata/hbase.version
405.9 G /hbasedata/oldWALs
0 /hbasedata/xxxx
0 /hbasedata/xxxx
0 /hbasedata/xxxxx{code}
*Zookeeper myQueuesZnodes:*
The replication queues entries are not cleared from zookeeper for a long time
{code:java}
[zk: 10.216.xxx.xxx:2191,10.216.xxx.xxx:2191,10.216.xxx.xxx:2191(CONNECTED) 3]
get
/hbase/replication/rs/10.216.xxx.xxx,16020,1554361253037/3/10.216.xxx.xxx%2C16020%2C1554361253037.1564
Display all 179 possibilities? (y or n)
10.216.xxx.xxx%2C16020%2C1554361253037.1554638617629
10.216.xxx.xxx%2C16020%2C1554361253037.1554485372383
....
{code}
*Status 'replication' output of the regionserver:*
{code:java}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1,
TimeStampsOfLastShippedOp=Wed May 01 07:40:55 IST 2019, Replication
Lag=8158202958 PeerID=3, AgeOfLastShippedOp=0, SizeOfLogQueue=1,
TimeStampsOfLastShippedOp=Thu Jan 01 05:30:00 IST 1970, Replication
Lag=1564834858373 SINK : AgeOfLastAppliedOp=378, TimeStampsOfLastAppliedOp=Sat
Aug 03 17:50:44 IST 2019
{code}
*Stacktrace of the replicationSource thread:*
We can see the thread has been blocked for a long time.
{code:java}
"main-EventThread.replicationSource,3.replicationSource.10.216.xxx.xxx%2C16020%2C1554360804184,3"
#10121291 daemon prio=5 os_prio=0 tid=0x00007f00e14ea000 nid=0x6d4c0 waiting
on condition [0x00007ef7598c9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007ef8889a1b98> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)
{code}
**Kindly let us know if you need any specific logs for your analysis.
> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2
> clusters)
> -------------------------------------------------------------------------------------
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Replication
> Affects Versions: 1.4.9, 1.4.10
> Reporter: Solvannan R M
> Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing,
> we observed the following behaviour.
> # New entry is added to WAL (Edit replicated from other cluster).
> # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards
> new entry from other cluster).
> # Entry is null, RSWALRT neither updates the batch stats
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
> # ReplicationSource thread is blocked in entryBachQueue.take().
> # So ReplicationSource#updateLogPosition has never invoked and WAL file is
> never cleared from ReplicationQueue.
> # Hence LogCleaner on the master, doesn't deletes the oldWAL files from
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread
> process and clears the oldWAL files from replication queues and hence master
> cleans up the WALs
> Please provide us a solution
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)