[
https://issues.apache.org/jira/browse/HBASE-26768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chenglei updated HBASE-26768:
-----------------------------
Description:
It seems that the problem HBASE-26449 described still exists,just in following
{{RegionReplicationSink.onComplete}},
which is running in Netty's nioEventLoop,first. First we add a replica to the
{{RegionReplicationSink.failedReplicas } because of a failure of replicating in
following line 228, but before we enter the line 238, Flushing Thread calls
{{RegionReplicationSink.add}} and we clear the
{{RegionReplicationSink.failedReplicas}} due to a flush all edit in
{{RegionReplicationSink.add}}. When the Netty nioEventLoop continues to enter
line 238, we still add a replica to the failedReplicas even though the
{{maxSequenceId < lastFlushedSequenceId}}.
{code:java}
207 private void onComplete(List<SinkEntry> sent,
208 Map<Integer, MutableObject<Throwable>> replica2Error) {
....
217 Set<Integer> failed = new HashSet<>();
218 for (Map.Entry<Integer, MutableObject<Throwable>> entry :
replica2Error.entrySet()) {
219 Integer replicaId = entry.getKey();
220 Throwable error = entry.getValue().getValue();
221 if (error != null) {
222 if (maxSequenceId > lastFlushedSequenceId) {
...
228 failed.add(replicaId);
229 } else {
......
238 synchronized (entries) {
239 pendingSize -= toReleaseSize;
240 if (!failed.isEmpty()) {
241 failedReplicas.addAll(failed);
242 flushRequester.requestFlush(maxSequenceId);
243 }
......
253 }
254 }
{code}
They are in different threads, so it is possible that, we have already clear
the failedReplicas due to a flush all edit, then in the callback of replay, we
add a replica to the failedReplicas because of a failure of replicating,
although the failure is actually before the flush all edit.
> Avoid unnecessary replication suspending in RegionReplicationSink
> -----------------------------------------------------------------
>
> Key: HBASE-26768
> URL: https://issues.apache.org/jira/browse/HBASE-26768
> Project: HBase
> Issue Type: Bug
> Components: read replicas
> Affects Versions: 3.0.0-alpha-2
> Reporter: chenglei
> Priority: Major
>
> It seems that the problem HBASE-26449 described still exists,just in
> following {{RegionReplicationSink.onComplete}},
> which is running in Netty's nioEventLoop,first. First we add a replica to the
> {{RegionReplicationSink.failedReplicas } because of a failure of replicating
> in following line 228, but before we enter the line 238, Flushing Thread
> calls {{RegionReplicationSink.add}} and we clear the
> {{RegionReplicationSink.failedReplicas}} due to a flush all edit in
> {{RegionReplicationSink.add}}. When the Netty nioEventLoop continues to enter
> line 238, we still add a replica to the failedReplicas even though the
> {{maxSequenceId < lastFlushedSequenceId}}.
> {code:java}
> 207 private void onComplete(List<SinkEntry> sent,
> 208 Map<Integer, MutableObject<Throwable>> replica2Error) {
> ....
> 217 Set<Integer> failed = new HashSet<>();
> 218 for (Map.Entry<Integer, MutableObject<Throwable>> entry :
> replica2Error.entrySet()) {
> 219 Integer replicaId = entry.getKey();
> 220 Throwable error = entry.getValue().getValue();
> 221 if (error != null) {
> 222 if (maxSequenceId > lastFlushedSequenceId) {
> ...
> 228 failed.add(replicaId);
> 229 } else {
> ......
> 238 synchronized (entries) {
> 239 pendingSize -= toReleaseSize;
> 240 if (!failed.isEmpty()) {
> 241 failedReplicas.addAll(failed);
> 242 flushRequester.requestFlush(maxSequenceId);
> 243 }
> ......
> 253 }
> 254 }
> {code}
> They are in different threads, so it is possible that, we have already clear
> the failedReplicas due to a flush all edit, then in the callback of replay,
> we add a replica to the failedReplicas because of a failure of replicating,
> although the failure is actually before the flush all edit.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)