Tomas created HBASE-29501: ----------------------------- Summary: IOException in SerialReplicationChecker.canPush causes entries to be pushed out of order Key: HBASE-29501 URL: https://issues.apache.org/jira/browse/HBASE-29501 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.6.2 Reporter: Tomas
HBase version: 2.6.2-hadoop3, revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17 h1. Problem In several HBase test clusters with serial replication enabled, observed entries with higher sequence ID being pushed before entries with lower sequence ID when _SerialReplicationChecker.canPush_ throws an {_}IOException{_}. The exception is caught in {_}SerialReplicationSourceWALReader.readWALEntries{_}. When handling the exception instead of breaking out of the surrounding for loop the code may continue to push the entry and record its sequence ID in zookeeper: {code:java} try { if (!checker.canPush(entry, firstCellInEntryBeforeFiltering)) { if (batch.getLastWalPosition() > positionBefore) { // we have something that can push, break break; } else { checker.waitUntilCanPush(entry, firstCellInEntryBeforeFiltering); } } } catch (IOException e) { LOG.warn("failed to check whether we can push the WAL entries", e); if (batch.getLastWalPosition() > positionBefore) { // we have something that can push, break break; } sleepMultiplier = sleep(sleepMultiplier); } // <--- continue here after exception is caught // arrive here means we can push the entry, record the last sequence id batch.setLastSeqId(Bytes.toString(entry.getKey().getEncodedRegionName()), entry.getKey().getSequenceId()); // actually remove the entry. removeEntryFromStream(entryStream, batch); if (addEntryToBatch(batch, entry)) { break; } {code} h2. IOException Example 1) Regionserver is terminating, causing `{_}java.io.IOException: connection is closed{_}` when scanning meta table for barriers. RS shutdown may race with shipper finishing replicating the entry: {code:java} 2025-08-01T18:15:46,477 WARN [regionserver/home-host-1:16020.replicationSource.wal-reader.home-host-1%2C16020%2C1754068134363,peer_2] regionserver.SerialReplicationSourceWALReader: failed to check whether we can push the WAL entries java.io.IOException: connection is closed at org.apache.hadoop.hbase.MetaTableAccessor.getMetaHTable(MetaTableAccessor.java:236) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2041) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.waitUntilCanPush(SerialReplicationChecker.java:268) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:89) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] {code} h2. IOException Example 2) Timeout reading barriers from hbase:meta table: {code:java} 2025-08-06T11:42:10,495 WARN [regionserver/home-host-1:16020.replicationSource,peer_1.replicationSource.wal-reader.home-host-1%2C16020%2C1754475014225,peer_1] regionserver.SerialReplicationSourceWALReader: failed to check whether we can push the WAL entries java.io.IOException: Failed to get result within timeout, timeout=60000ms at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:250) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:53) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:206) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:281) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:450) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:324) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:622) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2043) ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:262) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:84) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] at org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35) ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3] {code} h2. Other IOException's It's possible that reading _pushedSeqId_ from zookeeper can also throw an IOException. h1. Impact This bug breaks serial replication guarantees (entries must be pushed in order based on their seqId). -- This message was sent by Atlassian Jira (v8.20.10#820010)