[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900971#comment-16900971
 ] 

Solvannan R M commented on HBASE-22784:
---------------------------------------

Hi [~wchevreuil],

Thanks for the pointers !

1. *ReplicationWALReaderThread stack trace*
{code:java}
"main-EventThread.replicationSource,3.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554360804184,3"
 #10121292 daemon prio=5 os_prio=0 tid=0x00007f00e0f75000 nid=0x6d4c1 waiting 
on condition [0x00007ef765a8e000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:162)
{code}
2. Even when restarted the regionserver, the logs have not been cleared.

3. *RegionServer logs:*

**When enabled TRACE logs, the RSWALRT keeps on printing the following message
{code:java}
[regionserver//10.216.xxx.xxx:16020.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554361253037,1]
 regionserver.ReplicationSourceWALReaderThread: Didn't read any new entries 
from WAL
2019-08-03 17:48:56,722 TRACE 
[main-EventThread.replicationSource,3.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554361253037,3]
 regionserver.ReplicationSourceWALReaderThread: Didn't read any new entries 
from WAL
2019-08-03 17:48:57,725 TRACE 
[main-EventThread.replicationSource,3.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554361253037,3]
 regionserver.ReplicationSourceWALReaderThread: Didn't read any new entries 
from WAL
{code}
As we analyzed the replication source and on running the debugger in 
Regionserver process, we came to the observations that we have mentioned in the 
description, where the RSWALRT doesn't queue any entries, leaving the 
ReplicationSourceShipperThread in a blocked state.

Also we came across the Jira HBASE-22620 which seems to be relevant.

As for the cyclic replication setup, our use case is an Active - Active setup 
and when there is no load in one side, this problem occurs

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-22784
>                 URL: https://issues.apache.org/jira/browse/HBASE-22784
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Replication
>    Affects Versions: 1.4.9, 1.4.10
>            Reporter: Solvannan R M
>            Assignee: Wellington Chevreuil
>            Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to