[
https://issues.apache.org/jira/browse/HBASE-26768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504486#comment-17504486
]
Hudson commented on HBASE-26768:
--------------------------------
Results for branch master
[build #531 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(x) {color:red}-1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/531/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> Avoid unnecessary replication suspending in RegionReplicationSink
> -----------------------------------------------------------------
>
> Key: HBASE-26768
> URL: https://issues.apache.org/jira/browse/HBASE-26768
> Project: HBase
> Issue Type: Bug
> Components: read replicas
> Affects Versions: 3.0.0-alpha-2
> Reporter: chenglei
> Assignee: chenglei
> Priority: Major
> Fix For: 3.0.0-alpha-3
>
>
> It seems that the problem HBASE-26449 described still exists in following
> {{RegionReplicationSink.onComplete}},which is running in Netty's
> nioEventLoop. Assuming we have only one secondary replica, first we add the
> replica to the {{failed}} because a failure of replicating in following line
> 228, but before we enter the line 238, the flusher thread calls
> {{RegionReplicationSink.add}} and we clear the
> {{RegionReplicationSink.failedReplicas}} due to a flush all edit. When the
> Netty nioEventLoop continues to enter line 238, we still add a replica to the
> failedReplicas even though the {{maxSequenceId < lastFlushedSequenceId}}.
> {code:java}
> 207 private void onComplete(List<SinkEntry> sent,
> 208 Map<Integer, MutableObject<Throwable>> replica2Error) {
> ....
> 217 Set<Integer> failed = new HashSet<>();
> 218 for (Map.Entry<Integer, MutableObject<Throwable>> entry :
> replica2Error.entrySet()) {
> 219 Integer replicaId = entry.getKey();
> 220 Throwable error = entry.getValue().getValue();
> 221 if (error != null) {
> 222 if (maxSequenceId > lastFlushedSequenceId) {
> ...
> 228 failed.add(replicaId);
> 229 } else {
> ......
> 238 synchronized (entries) {
> 239 pendingSize -= toReleaseSize;
> 240 if (!failed.isEmpty()) {
> 241 failedReplicas.addAll(failed);
> 242 flushRequester.requestFlush(maxSequenceId);
> 243 }
> ......
> 253 }
> 254 }
> {code}
> What is worse, when we invoke
> {{RegionReplicationFlushRequester.requestFlush}}, the flushing may be skipped
> because in following {{RegionReplicationFlushRequester.flush}},
> {{pendingFlushRequestSequenceId}} is less than {{lastFlushedSequenceId}}, so
> the only secondary replica is marked failed and requested flushing is
> skipped, the replication may suspend until next memstore flush :
> {code:java}
> private synchronized void flush(Timeout timeout) {
> pendingFlushRequest = null;
> if (pendingFlushRequestSequenceId >= lastFlushedSequenceId) {
> request();
> }
> }
> {code}
> I simulate this problem in the PR and my fix is double check the {{if
> (maxSequenceId > lastFlushedSequenceId)}} in the {{synchronized}} block in
> {{RegionReplicationSink.onComplete}}.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)