[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

Wellington Chevreuil (JIRA) Wed, 07 Aug 2019 11:27:11 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902388#comment-16902388
 ]


Wellington Chevreuil commented on HBASE-22784:
----------------------------------------------

Yep, [~solvannan] analysis makes sense from what we can see in the logs/jstack. 
It seems this was introduced by refactorings from HBASE-15995. As 
[~anoop.hbase] mentioned, even if we find nothing to get replicated, we should 
still advance the reading position in the wal. When we do 
_[entryStream.hasNext|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L126],_
 which ends up in 
_[WALEntryStream.readNextAndSetPosition|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L280],_
 we are updating the position at the _WALEntryStream_ instance only. We then 
rely that _WALEntryStream.next_ _[returns 
WAL.Entry|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L130],_
 and that this entry is not filtered, so that it can [set the stream position 
into the WALEntryBatch instance to be 
queued|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L137].
 Then, as [~solvannan] originally pointed out, we only update log position if 
we get something back from the queue and call 
[_shipEdits_|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L551],
 which will then finally [update log 
position|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L638].
 

 Prior to HBASE-15995, read and shipment were done in the same thread. We can 
see it used to properly [set log position even if no entries were found for 
replication|https://github.com/apache/hbase/commit/3cf4433260b60a0e0455090628cf60a9d5a180f3?diff=split#diff-3ac91d43acf51f23f0ffd8b0e5d2e649L711].

Let me check which branches might get affected by this issue, then will work on 
a patch for this.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-22784
>                 URL: https://issues.apache.org/jira/browse/HBASE-22784
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, Replication
>    Affects Versions: 1.4.9, 1.4.10
>            Reporter: Solvannan R M
>            Assignee: Wellington Chevreuil
>            Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

Reply via email to