[ 
https://issues.apache.org/jira/browse/HBASE-26768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenglei updated HBASE-26768:
-----------------------------
    Description: 
It seems that the problem HBASE-26449 described still exists,just  in following 
{{RegionReplicationSink.onComplete}},
which is running in Netty's nioEventLoop,first. First we add a replica to the 
{{RegionReplicationSink.failedReplicas } because of a failure of replicating in 
following line 228, but before we enter the line 238, Flushing Thread calls 
{{RegionReplicationSink.add}} and we clear the 
{{RegionReplicationSink.failedReplicas}} due to a flush all edit in 
{{RegionReplicationSink.add}}. When the Netty nioEventLoop continues to enter 
line 238, we still add a replica to the failedReplicas even though the 
{{maxSequenceId < lastFlushedSequenceId}}.


{code:java}
207 private void onComplete(List<SinkEntry> sent,
208         Map<Integer, MutableObject<Throwable>> replica2Error) {
                      ....
217        Set<Integer> failed = new HashSet<>();
218        for (Map.Entry<Integer, MutableObject<Throwable>> entry : 
replica2Error.entrySet()) {
219        Integer replicaId = entry.getKey();
220       Throwable error = entry.getValue().getValue();
221        if (error != null) {
222           if (maxSequenceId > lastFlushedSequenceId) {
                     ...
228             failed.add(replicaId);
229           } else {
                   ......

238       synchronized (entries) {
239           pendingSize -= toReleaseSize;
240           if (!failed.isEmpty()) {
241                failedReplicas.addAll(failed);
242                flushRequester.requestFlush(maxSequenceId);
243           }
                ......
253      }
254    }
{code}

They are in different threads, so it is possible that, we have already clear 
the failedReplicas due to a flush all edit, then in the callback of replay, we 
add a replica to the failedReplicas because of a failure of replicating, 
although the failure is actually before the flush all edit.


> Avoid unnecessary replication suspending in RegionReplicationSink
> -----------------------------------------------------------------
>
>                 Key: HBASE-26768
>                 URL: https://issues.apache.org/jira/browse/HBASE-26768
>             Project: HBase
>          Issue Type: Bug
>          Components: read replicas
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: chenglei
>            Priority: Major
>
> It seems that the problem HBASE-26449 described still exists,just  in 
> following {{RegionReplicationSink.onComplete}},
> which is running in Netty's nioEventLoop,first. First we add a replica to the 
> {{RegionReplicationSink.failedReplicas } because of a failure of replicating 
> in following line 228, but before we enter the line 238, Flushing Thread 
> calls {{RegionReplicationSink.add}} and we clear the 
> {{RegionReplicationSink.failedReplicas}} due to a flush all edit in 
> {{RegionReplicationSink.add}}. When the Netty nioEventLoop continues to enter 
> line 238, we still add a replica to the failedReplicas even though the 
> {{maxSequenceId < lastFlushedSequenceId}}.
> {code:java}
> 207 private void onComplete(List<SinkEntry> sent,
> 208         Map<Integer, MutableObject<Throwable>> replica2Error) {
>                       ....
> 217        Set<Integer> failed = new HashSet<>();
> 218        for (Map.Entry<Integer, MutableObject<Throwable>> entry : 
> replica2Error.entrySet()) {
> 219        Integer replicaId = entry.getKey();
> 220       Throwable error = entry.getValue().getValue();
> 221        if (error != null) {
> 222           if (maxSequenceId > lastFlushedSequenceId) {
>                      ...
> 228             failed.add(replicaId);
> 229           } else {
>                    ......
> 238       synchronized (entries) {
> 239           pendingSize -= toReleaseSize;
> 240           if (!failed.isEmpty()) {
> 241                failedReplicas.addAll(failed);
> 242                flushRequester.requestFlush(maxSequenceId);
> 243           }
>                 ......
> 253      }
> 254    }
> {code}
> They are in different threads, so it is possible that, we have already clear 
> the failedReplicas due to a flush all edit, then in the callback of replay, 
> we add a replica to the failedReplicas because of a failure of replicating, 
> although the failure is actually before the flush all edit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to