[jira] [Comment Edited] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

Solvannan R M (JIRA) Sat, 03 Aug 2019 06:08:56 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899460#comment-16899460
 ]


Solvannan R M edited comment on HBASE-22784 at 8/3/19 1:07 PM:
---------------------------------------------------------------

Hi [~wchevreuil],

Thanks for the reply !

We have managed to extract the following information relating to the issue:

*HBase DFS Usage Report:*

We could see the oldWALs size is very high compared to actual data size
{code:java}
[email protected]~>./hadoop-2.7.3/bin/hdfs dfs -du -h /hbasedata
0 /hbasedata/.tmp
0 /hbasedata/xxxxxx
0 /hbasedata/MasterProcWALs
50.3 G /hbasedata/WALs
0 /hbasedata/archive
0 /hbasedata/corrupt
561.4 G /hbasedata/data
0 /hbasedata/hbase
42 /hbasedata/hbase.id
7 /hbasedata/hbase.version
405.9 G /hbasedata/oldWALs
0 /hbasedata/xxxx
0 /hbasedata/xxxx
0 /hbasedata/xxxxx{code}
 

*Zookeeper myQueuesZnodes:*

The replication queues entries are not cleared from zookeeper for a long time
{code:java}
[zk: 10.216.xxx.xxx:2191,10.216.xxx.xxx:2191,10.216.xxx.xxx:2191(CONNECTED) 3] 
get 
/hbase/replication/rs/10.216.xxx.xxx,16020,1554361253037/3/10.216.xxx.xxx%2C16020%2C1554361253037.1564
Display all 179 possibilities? (y or n)
10.216.xxx.xxx%2C16020%2C1554361253037.1554638617629 
10.216.xxx.xxx%2C16020%2C1554361253037.1554485372383
....
{code}
 

*Status 'replication' output of the regionserver:*
{code:java}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed May 01 07:40:55 IST 2019, Replication 
Lag=8158202958 PeerID=3, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Thu Jan 01 05:30:00 IST 1970, Replication 
Lag=1564834858373 SINK : AgeOfLastAppliedOp=378, TimeStampsOfLastAppliedOp=Sat 
Aug 03 17:50:44 IST 2019
{code}
 

*Stacktrace of the replicationSource thread:*

We can see the thread has been blocked for a long time.
{code:java}
"main-EventThread.replicationSource,3.replicationSource.10.216.xxx.xxx%2C16020%2C1554360804184,3"
 #10121291 daemon prio=5 os_prio=0 tid=0x00007f00e14ea000 nid=0x6d4c0 waiting 
on condition [0x00007ef7598c9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007ef8889a1b98> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)
{code}
Kindly let us know if you need any specific logs for your analysis.


was (Author: solvannan):
 

 

Hi [~wchevreuil],

Thanks for the reply !

We have managed to extract the following information relating to the issue:

*HBase DFS Usage Report:*

We could see the oldWALs size is very high compared to actual data size
{code:java}
[email protected]~>./hadoop-2.7.3/bin/hdfs dfs -du -h /hbasedata
0 /hbasedata/.tmp
0 /hbasedata/xxxxxx
0 /hbasedata/MasterProcWALs
50.3 G /hbasedata/WALs
0 /hbasedata/archive
0 /hbasedata/corrupt
561.4 G /hbasedata/data
0 /hbasedata/hbase
42 /hbasedata/hbase.id
7 /hbasedata/hbase.version
405.9 G /hbasedata/oldWALs
0 /hbasedata/xxxx
0 /hbasedata/xxxx
0 /hbasedata/xxxxx{code}
 

*Zookeeper myQueuesZnodes:*

The replication queues entries are not cleared from zookeeper for a long time
{code:java}
[zk: 10.216.xxx.xxx:2191,10.216.xxx.xxx:2191,10.216.xxx.xxx:2191(CONNECTED) 3] 
get 
/hbase/replication/rs/10.216.xxx.xxx,16020,1554361253037/3/10.216.xxx.xxx%2C16020%2C1554361253037.1564
Display all 179 possibilities? (y or n)
10.216.xxx.xxx%2C16020%2C1554361253037.1554638617629 
10.216.xxx.xxx%2C16020%2C1554361253037.1554485372383
....
{code}
 

 

*Status 'replication' output of the regionserver:*

 
{code:java}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed May 01 07:40:55 IST 2019, Replication 
Lag=8158202958 PeerID=3, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Thu Jan 01 05:30:00 IST 1970, Replication 
Lag=1564834858373 SINK : AgeOfLastAppliedOp=378, TimeStampsOfLastAppliedOp=Sat 
Aug 03 17:50:44 IST 2019
{code}
 

 

 

*Stacktrace of the replicationSource thread:*

We can see the thread has been blocked for a long time.
{code:java}
"main-EventThread.replicationSource,3.replicationSource.10.216.xxx.xxx%2C16020%2C1554360804184,3"
 #10121291 daemon prio=5 os_prio=0 tid=0x00007f00e14ea000 nid=0x6d4c0 waiting 
on condition [0x00007ef7598c9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007ef8889a1b98> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)
{code}
**Kindly let us know if you need any specific logs for your analysis.

 

 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-22784
>                 URL: https://issues.apache.org/jira/browse/HBASE-22784
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Replication
>    Affects Versions: 1.4.9, 1.4.10
>            Reporter: Solvannan R M
>            Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

Reply via email to