Tomas created HBASE-29499: ----------------------------- Summary: Serial replication stuck pushing entry with seqId equal to barrier Key: HBASE-29499 URL: https://issues.apache.org/jira/browse/HBASE-29499 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.6.2 Reporter: Tomas
HBase version: 2.6.2-hadoop3, revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17 h1. Problem On several test HBase clusters with serial replication enabled and where regionservers frequently crash / perform non-graceful shutdown, we found that WAL can contain entries with seqId equal to a barrier in the meta table, e.g. barriers for region X = [2, 5, 6], entry for region X seqId = 6 (equals to barrier with value 6), and pushedSeqId=4 (seqId-2). When checking if can push those entries in {_}SerialReplicationChecker{_}, _canPush_ will return false, causing replication to block indefinitely. Example 1: {{2025-07-22T16:12:06,070 DEBUG [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068] regionserver.SerialReplicationChecker: Replication barrier for test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/{*}39{*}=[#edits: 0 = <>]: ReplicationBarrierResult [{*}barriers=[9, 17, 25, 28, 31, 34, 38, 39{*}], state=OPEN, parentRegionNames=]}} {{2025-07-22T16:12:06,072 DEBUG [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068] regionserver.SerialReplicationChecker: *Previous range for test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>] has not been finished yet, give up*}} {{2025-07-22T16:12:06,072 DEBUG [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068] regionserver.SerialReplicationChecker: Can not push test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>], wait}} * barriers=[9, 17, 25, 28, 31, 34, 38, 39] * Entry is for HBASE::REGION_EVENT::REGION_OPEN with seqid=39 from *not the last* range (replication queue is claimed). * pushedSeqId=37 The previous range is calculated as 39 instead of 38, and 37 >= 39-1 is false. See [https://docs.google.com/document/d/1iB2xopSoC2IRHR8wmbGX5cmaS0RKsdFJiKeJ7EyLzeg] for more supporting information (zookeeper state, WALs). Example 2: {{2025-08-05T07:43:53,198 DEBUG [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843 {}] regionserver.SerialReplicationChecker: Replication barrier for aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/{*}650974464{*}=[#edits: 0 = <>]: ReplicationBarrierResult [barriers=[649436971, {*}650974464{*}, 650990494, 651037843, 651092522, 651096754, 651118516, 651147941, 651173589], state=OPEN, parentRegionNames=]}} {{2025-08-05T07:43:53,199 DEBUG [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843 {}] regionserver.SerialReplicationChecker: *Previous range for aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>] has not been finished yet, give up*}} {{2025-08-05T07:43:53,199 DEBUG [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843 {}] regionserver.SerialReplicationChecker: Can not push aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>], wait}} * barriers=[649436971, 650974464, 650990494, …] * Entry is with seqid=650974464 from *not the last* range (replication queue is claimed). * pushedSeqId=650974462 The previous range is calculated as 650974464 instead of 649436971, and 650974462 >= 650974464-1 is false. h1. Impact Replication is blocked indefinitely for regions that contain the problematic entry. Entries with higher seqId than the problematic entry cannot be replicated due to previous range(s) not being finished yet. Metric _sizeoflogqueue_ grows indefinitely as data gets written to the region(s) and WAL's are rolled. h1. Workarounds N/A. Turn off serial mode and replicate non-serially OR remove and re-add peer to restart replication (will have a gap in data replicated). -- This message was sent by Atlassian Jira (v8.20.10#820010)