Tomas created HBASE-29499:
-----------------------------

             Summary: Serial replication stuck pushing entry with seqId equal 
to barrier
                 Key: HBASE-29499
                 URL: https://issues.apache.org/jira/browse/HBASE-29499
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 2.6.2
            Reporter: Tomas


HBase version: 2.6.2-hadoop3, revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17 
h1. Problem

On several test HBase clusters with serial replication enabled and where 
regionservers frequently crash / perform non-graceful shutdown, we found that 
WAL can contain entries with seqId equal to a barrier in the meta table, e.g. 
barriers for region X = [2, 5, 6], entry for region X seqId = 6 (equals to 
barrier with value 6), and pushedSeqId=4 (seqId-2).

 

When checking if can push those entries in {_}SerialReplicationChecker{_}, 
_canPush_ will return false, causing replication to block indefinitely.

 

Example 1:

{{2025-07-22T16:12:06,070 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
 regionserver.SerialReplicationChecker: Replication barrier for 
test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/{*}39{*}=[#edits: 0 = <>]: 
ReplicationBarrierResult [{*}barriers=[9, 17, 25, 28, 31, 34, 38, 39{*}], 
state=OPEN, parentRegionNames=]}}

{{2025-07-22T16:12:06,072 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
 regionserver.SerialReplicationChecker: *Previous range for 
test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>] has not been 
finished yet, give up*}}

{{2025-07-22T16:12:06,072 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
 regionserver.SerialReplicationChecker: Can not push 
test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>], wait}}

 
 * barriers=[9, 17, 25, 28, 31, 34, 38, 39]
 * Entry is for HBASE::REGION_EVENT::REGION_OPEN with seqid=39 from *not the 
last* range (replication queue is claimed).
 * pushedSeqId=37

 

The previous range is calculated as 39 instead of 38, and 37 >= 39-1 is false.

 

See 
[https://docs.google.com/document/d/1iB2xopSoC2IRHR8wmbGX5cmaS0RKsdFJiKeJ7EyLzeg]
 for more supporting information (zookeeper state, WALs).

 

Example 2:

 

{{2025-08-05T07:43:53,198 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
 {}] regionserver.SerialReplicationChecker: Replication barrier for 
aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/{*}650974464{*}=[#edits: 0 = <>]: 
ReplicationBarrierResult [barriers=[649436971, {*}650974464{*}, 650990494, 
651037843, 651092522, 651096754, 651118516, 651147941, 651173589], state=OPEN, 
parentRegionNames=]}}

{{2025-08-05T07:43:53,199 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
 {}] regionserver.SerialReplicationChecker: *Previous range for 
aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>] has not 
been finished yet, give up*}}

{{2025-08-05T07:43:53,199 DEBUG 
[RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
 {}] regionserver.SerialReplicationChecker: Can not push 
aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>], wait}}

 
 * barriers=[649436971, 650974464, 650990494, …]
 * Entry is with seqid=650974464 from *not the last* range (replication queue 
is claimed).
 * pushedSeqId=650974462

 

The previous range is calculated as 650974464 instead of 649436971, and 
650974462 >= 650974464-1 is false.
h1. Impact

Replication is blocked indefinitely for regions that contain the problematic 
entry.

Entries with higher seqId than the problematic entry cannot be replicated due 
to previous range(s) not being finished yet.

Metric _sizeoflogqueue_ grows indefinitely as data gets written to the 
region(s) and WAL's are rolled. 
h1. Workarounds

N/A. 

Turn off serial mode and replicate non-serially OR remove and re-add peer to 
restart replication (will have a gap in data replicated).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to