[jira] [Commented] (HBASE-26768) Avoid unnecessary replication suspending in RegionReplicationSink

Hudson (Jira) Thu, 10 Mar 2022 10:01:17 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-26768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504486#comment-17504486
 ]


Hudson commented on HBASE-26768:
--------------------------------

Results for branch master
        [build #531 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/General_20Nightly_20Build_20Report/]






(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Avoid unnecessary replication suspending in RegionReplicationSink
> -----------------------------------------------------------------
>
>                 Key: HBASE-26768
>                 URL: https://issues.apache.org/jira/browse/HBASE-26768
>             Project: HBase
>          Issue Type: Bug
>          Components: read replicas
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: chenglei
>            Assignee: chenglei
>            Priority: Major
>             Fix For: 3.0.0-alpha-3
>
>
> It seems that the problem HBASE-26449 described still exists in following 
> {{RegionReplicationSink.onComplete}},which is running in Netty's 
> nioEventLoop. Assuming we have only one secondary replica, first we add the 
> replica to the {{failed}} because a failure of replicating in following line 
> 228, but before we enter the line 238, the flusher thread calls 
> {{RegionReplicationSink.add}} and we clear the 
> {{RegionReplicationSink.failedReplicas}} due to a flush all edit. When the 
> Netty nioEventLoop continues to enter line 238, we still add a replica to the 
> failedReplicas even though the {{maxSequenceId < lastFlushedSequenceId}}.
> {code:java}
> 207 private void onComplete(List<SinkEntry> sent,
> 208         Map<Integer, MutableObject<Throwable>> replica2Error) {
>                       ....
> 217        Set<Integer> failed = new HashSet<>();
> 218        for (Map.Entry<Integer, MutableObject<Throwable>> entry : 
> replica2Error.entrySet()) {
> 219        Integer replicaId = entry.getKey();
> 220       Throwable error = entry.getValue().getValue();
> 221        if (error != null) {
> 222           if (maxSequenceId > lastFlushedSequenceId) {
>                      ...
> 228             failed.add(replicaId);
> 229           } else {
>                    ......
> 238       synchronized (entries) {
> 239           pendingSize -= toReleaseSize;
> 240           if (!failed.isEmpty()) {
> 241                failedReplicas.addAll(failed);
> 242                flushRequester.requestFlush(maxSequenceId);
> 243           }
>                 ......
> 253      }
> 254    }
> {code}
> What is worse, when we invoke 
> {{RegionReplicationFlushRequester.requestFlush}}, the flushing may be skipped 
> because in following {{RegionReplicationFlushRequester.flush}}, 
> {{pendingFlushRequestSequenceId}} is less than {{lastFlushedSequenceId}}, so 
> the only secondary replica is marked failed and requested flushing is 
> skipped, the replication may suspend until next memstore flush :
> {code:java}
> private synchronized void flush(Timeout timeout) {
>     pendingFlushRequest = null;
>     if (pendingFlushRequestSequenceId >= lastFlushedSequenceId) {
>       request();
>     }
>   }
> {code}
> I simulate this problem in the PR and my fix is double check the  {{if 
> (maxSequenceId > lastFlushedSequenceId)}} in the {{synchronized}} block in 
> {{RegionReplicationSink.onComplete}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HBASE-26768) Avoid unnecessary replication suspending in RegionReplicationSink

Reply via email to