[ 
https://issues.apache.org/jira/browse/HBASE-28155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775796#comment-17775796
 ] 

Duo Zhang commented on HBASE-28155:
-----------------------------------

I think this is common problem for all branches.

When a shipper for RecoveredReplicationSource is finished, we will try to 
cleanup everything. And in ReplicationSource.initialize, we will create and 
start the shippers one by one, so it is possible that, a shipper has already 
been started and finished, the second shipper has not been added to the workers 
map yet, and then the logic in tryFinish method will cleanup everything.

On branch-2.x, there is a sleep in the tryFinish method which can reduce the 
possibility a lot, but it could still happen theoretically. For master and 
branch-3, there is no sleep so the possibility is much greater than branch-2.x.

The code for master and branch-3

{code}
        if (workerThreads.isEmpty()) {
          this.getSourceMetrics().clear();
          manager.finishRecoveredSource(this);
        }
{code}

For branch-2.x

{code}
    synchronized (workerThreads) {
      Threads.sleep(100);// wait a short while for other worker thread to fully 
exit
      boolean allTasksDone = workerThreads.values().stream().allMatch(w -> 
w.isFinished());
      if (allTasksDone) {
        this.getSourceMetrics().clear();
        manager.removeRecoveredSource(this);
        LOG.info("Finished recovering queue {} with the following stats: {}", 
queueId, getStats());
      }
    }
{code}



> RecoveredReplicationSource quit when there are still unfinished groups
> ----------------------------------------------------------------------
>
>                 Key: HBASE-28155
>                 URL: https://issues.apache.org/jira/browse/HBASE-28155
>             Project: HBase
>          Issue Type: Bug
>          Components: Recovery, Replication
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Critical
>             Fix For: 2.6.0, 3.0.0-beta-1
>
>         Attachments: 
> org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS-output.txt
>
>
> Need to dig more but it seems to related to how we deal with 
> RecoveredReplicationSource and queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to