[ 
https://issues.apache.org/jira/browse/HBASE-29501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-29501.
-------------------------------
    Fix Version/s: 2.7.0
                   3.0.0-beta-2
                   2.6.5
     Hadoop Flags: Reviewed
       Resolution: Fixed

Pushed to branch-2.6+.

Thanks [~tomasb] for contributing!

> IOException in SerialReplicationChecker.canPush causes entries to be pushed 
> out of order
> ----------------------------------------------------------------------------------------
>
>                 Key: HBASE-29501
>                 URL: https://issues.apache.org/jira/browse/HBASE-29501
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.6.2
>            Reporter: Tomas
>            Assignee: Tomas
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> HBase version: 2.6.2-hadoop3, 
> revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17 
> h1. Problem
> In several HBase test clusters with serial replication enabled, observed 
> entries with higher sequence ID being pushed before entries with lower 
> sequence ID when _SerialReplicationChecker.canPush_ throws an 
> {_}IOException{_}.
> The exception is caught in 
> {_}SerialReplicationSourceWALReader.readWALEntries{_}. When handling the 
> exception instead of breaking out of the surrounding for loop the code may 
> continue to push the entry and record its sequence ID in zookeeper:
>  
> {code:java}
> try {
>  if (!checker.canPush(entry, firstCellInEntryBeforeFiltering)) {
>    if (batch.getLastWalPosition() > positionBefore) {
>      // we have something that can push, break
>      break;
>    } else {
>      checker.waitUntilCanPush(entry, firstCellInEntryBeforeFiltering);
>    }
>  }
> } catch (IOException e) {
>  LOG.warn("failed to check whether we can push the WAL entries", e);
>  if (batch.getLastWalPosition() > positionBefore) {
>    // we have something that can push, break
>    break;
>  }
>  sleepMultiplier = sleep(sleepMultiplier);
> }
> // <--- continue here after exception is caught
> // arrive here means we can push the entry, record the last sequence id
> batch.setLastSeqId(Bytes.toString(entry.getKey().getEncodedRegionName()),
>  entry.getKey().getSequenceId());
> // actually remove the entry.
> removeEntryFromStream(entryStream, batch);
> if (addEntryToBatch(batch, entry)) {
>  break;
> }
> {code}
>  
> h2. IOException Example 1) 
> Regionserver is terminating, causing `{_}java.io.IOException: connection is 
> closed{_}` when scanning meta table for barriers. RS shutdown may race with 
> shipper finishing replicating the entry:
>  
> {code:java}
> 2025-08-01T18:15:46,477 WARN  
> [regionserver/home-host-1:16020.replicationSource.wal-reader.home-host-1%2C16020%2C1754068134363,peer_2]
>  regionserver.SerialReplicationSourceWALReader: failed to check whether we 
> can push the WAL entries
> java.io.IOException: connection is closed
>     at 
> org.apache.hadoop.hbase.MetaTableAccessor.getMetaHTable(MetaTableAccessor.java:236)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2041)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.waitUntilCanPush(SerialReplicationChecker.java:268)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:89)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> {code}
>  
> h2. IOException Example 2)
> Timeout reading barriers from hbase:meta table:
> {code:java}
> 2025-08-06T11:42:10,495 WARN  
> [regionserver/home-host-1:16020.replicationSource,peer_1.replicationSource.wal-reader.home-host-1%2C16020%2C1754475014225,peer_1]
>  regionserver.SerialReplicationSourceWALReader: failed to check whether we 
> can push the WAL entries
> java.io.IOException: Failed to get result within timeout, timeout=60000ms
>     at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:250)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:53)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:206)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:281) 
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:450)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:324)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:622) 
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2043)
>  ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:262)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:84)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
>     at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
>  ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> {code}
> h2. Other IOException's
> It's possible that reading _pushedSeqId_ from zookeeper can also throw an 
> IOException.
> h1. Impact
> This bug breaks serial replication guarantees (entries must be pushed in 
> order based on their seqId).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to