[
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902388#comment-16902388
]
Wellington Chevreuil commented on HBASE-22784:
----------------------------------------------
Yep, [~solvannan] analysis makes sense from what we can see in the logs/jstack.
It seems this was introduced by refactorings from HBASE-15995. As
[~anoop.hbase] mentioned, even if we find nothing to get replicated, we should
still advance the reading position in the wal. When we do
_[entryStream.hasNext|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L126],_
which ends up in
_[WALEntryStream.readNextAndSetPosition|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L280],_
we are updating the position at the _WALEntryStream_ instance only. We then
rely that _WALEntryStream.next_ _[returns
WAL.Entry|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L130],_
and that this entry is not filtered, so that it can [set the stream position
into the WALEntryBatch instance to be
queued|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L137].
Then, as [~solvannan] originally pointed out, we only update log position if
we get something back from the queue and call
[_shipEdits_|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L551],
which will then finally [update log
position|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L638].
Prior to HBASE-15995, read and shipment were done in the same thread. We can
see it used to properly [set log position even if no entries were found for
replication|https://github.com/apache/hbase/commit/3cf4433260b60a0e0455090628cf60a9d5a180f3?diff=split#diff-3ac91d43acf51f23f0ffd8b0e5d2e649L711].
Let me check which branches might get affected by this issue, then will work on
a patch for this.
> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2
> clusters)
> -------------------------------------------------------------------------------------
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
> Issue Type: Bug
> Components: regionserver, Replication
> Affects Versions: 1.4.9, 1.4.10
> Reporter: Solvannan R M
> Assignee: Wellington Chevreuil
> Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing,
> we observed the following behaviour.
> # New entry is added to WAL (Edit replicated from other cluster).
> # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards
> new entry from other cluster).
> # Entry is null, RSWALRT neither updates the batch stats
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
> # ReplicationSource thread is blocked in entryBachQueue.take().
> # So ReplicationSource#updateLogPosition has never invoked and WAL file is
> never cleared from ReplicationQueue.
> # Hence LogCleaner on the master, doesn't deletes the oldWAL files from
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread
> process and clears the oldWAL files from replication queues and hence master
> cleans up the WALs
> Please provide us a solution
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)