Tomas created HBASE-29501:
-----------------------------

             Summary: IOException in SerialReplicationChecker.canPush causes 
entries to be pushed out of order
                 Key: HBASE-29501
                 URL: https://issues.apache.org/jira/browse/HBASE-29501
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 2.6.2
            Reporter: Tomas


HBase version: 2.6.2-hadoop3, revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17 
h1. Problem

In several HBase test clusters with serial replication enabled, observed 
entries with higher sequence ID being pushed before entries with lower sequence 
ID when _SerialReplicationChecker.canPush_ throws an {_}IOException{_}.

The exception is caught in 
{_}SerialReplicationSourceWALReader.readWALEntries{_}. When handling the 
exception instead of breaking out of the surrounding for loop the code may 
continue to push the entry and record its sequence ID in zookeeper:

 
{code:java}
try {
 if (!checker.canPush(entry, firstCellInEntryBeforeFiltering)) {
   if (batch.getLastWalPosition() > positionBefore) {
     // we have something that can push, break
     break;
   } else {
     checker.waitUntilCanPush(entry, firstCellInEntryBeforeFiltering);
   }
 }
} catch (IOException e) {
 LOG.warn("failed to check whether we can push the WAL entries", e);
 if (batch.getLastWalPosition() > positionBefore) {
   // we have something that can push, break
   break;
 }
 sleepMultiplier = sleep(sleepMultiplier);
}
// <--- continue here after exception is caught
// arrive here means we can push the entry, record the last sequence id
batch.setLastSeqId(Bytes.toString(entry.getKey().getEncodedRegionName()),
 entry.getKey().getSequenceId());
// actually remove the entry.
removeEntryFromStream(entryStream, batch);
if (addEntryToBatch(batch, entry)) {
 break;
}
{code}
 
h2. IOException Example 1) 

Regionserver is terminating, causing `{_}java.io.IOException: connection is 
closed{_}` when scanning meta table for barriers. RS shutdown may race with 
shipper finishing replicating the entry:

 
{code:java}
2025-08-01T18:15:46,477 WARN  
[regionserver/home-host-1:16020.replicationSource.wal-reader.home-host-1%2C16020%2C1754068134363,peer_2]
 regionserver.SerialReplicationSourceWALReader: failed to check whether we can 
push the WAL entries
java.io.IOException: connection is closed
    at 
org.apache.hadoop.hbase.MetaTableAccessor.getMetaHTable(MetaTableAccessor.java:236)
 ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2041)
 ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.waitUntilCanPush(SerialReplicationChecker.java:268)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:89)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
{code}
 
h2. IOException Example 2)

Timeout reading barriers from hbase:meta table:
{code:java}
2025-08-06T11:42:10,495 WARN  
[regionserver/home-host-1:16020.replicationSource,peer_1.replicationSource.wal-reader.home-host-1%2C16020%2C1754475014225,peer_1]
 regionserver.SerialReplicationSourceWALReader: failed to check whether we can 
push the WAL entries
java.io.IOException: Failed to get result within timeout, timeout=60000ms
    at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:250)
 ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:53)
 ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:206)
 ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:281) 
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:450) 
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:324)
 ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:622) 
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2043)
 ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:262)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:84)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
    at 
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
 ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
{code}
h2. Other IOException's

It's possible that reading _pushedSeqId_ from zookeeper can also throw an 
IOException.
h1. Impact

This bug breaks serial replication guarantees (entries must be pushed in order 
based on their seqId).

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to