[jira] [Commented] (HBASE-26768) Avoid unnecessary replication suspending in RegionReplicationSink

chenglei (Jira) Wed, 09 Mar 2022 19:03:06 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-26768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503965#comment-17503965
 ]


chenglei commented on HBASE-26768:
----------------------------------

Pushed to the master,  [~zhangduo], thank you very much for review.

> Avoid unnecessary replication suspending in RegionReplicationSink
> -----------------------------------------------------------------
>
>                 Key: HBASE-26768
>                 URL: https://issues.apache.org/jira/browse/HBASE-26768
>             Project: HBase
>          Issue Type: Bug
>          Components: read replicas
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: chenglei
>            Assignee: chenglei
>            Priority: Major
>
> It seems that the problem HBASE-26449 described still exists in following 
> {{RegionReplicationSink.onComplete}},which is running in Netty's 
> nioEventLoop. Assuming we have only one secondary replica, first we add the 
> replica to the {{failed}} because a failure of replicating in following line 
> 228, but before we enter the line 238, the flusher thread calls 
> {{RegionReplicationSink.add}} and we clear the 
> {{RegionReplicationSink.failedReplicas}} due to a flush all edit. When the 
> Netty nioEventLoop continues to enter line 238, we still add a replica to the 
> failedReplicas even though the {{maxSequenceId < lastFlushedSequenceId}}.
> {code:java}
> 207 private void onComplete(List<SinkEntry> sent,
> 208         Map<Integer, MutableObject<Throwable>> replica2Error) {
>                       ....
> 217        Set<Integer> failed = new HashSet<>();
> 218        for (Map.Entry<Integer, MutableObject<Throwable>> entry : 
> replica2Error.entrySet()) {
> 219        Integer replicaId = entry.getKey();
> 220       Throwable error = entry.getValue().getValue();
> 221        if (error != null) {
> 222           if (maxSequenceId > lastFlushedSequenceId) {
>                      ...
> 228             failed.add(replicaId);
> 229           } else {
>                    ......
> 238       synchronized (entries) {
> 239           pendingSize -= toReleaseSize;
> 240           if (!failed.isEmpty()) {
> 241                failedReplicas.addAll(failed);
> 242                flushRequester.requestFlush(maxSequenceId);
> 243           }
>                 ......
> 253      }
> 254    }
> {code}
> What is worse, when we invoke 
> {{RegionReplicationFlushRequester.requestFlush}}, the flushing may be skipped 
> because in following {{RegionReplicationFlushRequester.flush}}, 
> {{pendingFlushRequestSequenceId}} is less than {{lastFlushedSequenceId}}, so 
> the only secondary replica is marked failed and requested flushing is 
> skipped, the replication may suspend until next memstore flush :
> {code:java}
> private synchronized void flush(Timeout timeout) {
>     pendingFlushRequest = null;
>     if (pendingFlushRequestSequenceId >= lastFlushedSequenceId) {
>       request();
>     }
>   }
> {code}
> I simulate this problem in the PR and my fix is double check the  {{if 
> (maxSequenceId > lastFlushedSequenceId)}} in the {{synchronized}} block in 
> {{RegionReplicationSink.onComplete}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HBASE-26768) Avoid unnecessary replication suspending in RegionReplicationSink

Reply via email to