[ 
https://issues.apache.org/jira/browse/HBASE-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770632#comment-17770632
 ] 

Duo Zhang commented on HBASE-28114:
-----------------------------------

Paste the findings on github PR here

{quote}
I've found a way to deal with the problem without checking whether the 
replication source is recovered or not. And Why I notice this is because I can 
not reproduce the problem simply when writing UT...

As if we want to simulate the problem on branch-2+, we need to first let the 
shipper thread finish calling 
walFileLengthProvider.getLogFileSizeIfBeingWritten, and halt there, waiting for 
the log roller thread to start rolling, finish writing the trailer, and before 
finish postLogRoll(where we enqueue the new WAL file into logQueue), the 
shipper thread resumes, finish the read, dequeue the current WAL file, get a 
RETRY_IMMEDIATELY, and then issue the next read, finally we can hit the empty 
logQueue...

Let me think how to simuate this in a UT...
{quote}

{quote}
I tried to reproduce the problem by matching the above executing sequence, but 
then I found that the newly added code was not executed...

The problem here is that, we are only safe to move to the next file when 
beingWritten == true when we get a EOF_WITH_TRAILER, and we will schedule a 
close writer task, which will write the trailer, under the rollWriterLock. And 
the trailerPresent flag is set while opening the wal reader.

This means that if we get beingWritten == true, then it is impossible the 
reader has a trailerPresent == true at the same time, because we can make sure 
that the wal reader is opened before we write the trailer, so we can only get a 
EOF_AND_RESET, and next time when we hit 
walFileLengthProvider.getLogFileSizeIfBeingWritten again, it can only return 
beingWritten == false, because we need to get the rollWriterLock, and once we 
get the rollWriterLock, it means the log roll has already been finished, thus 
the postLogRoll has been called so the new WAL file has been enqueued to the 
logQueue...

Let me just add more comments in code to describe the logic here.
{quote}

> Add more comments to explain why replication log queue could never be empty 
> for normal replication queue
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-28114
>                 URL: https://issues.apache.org/jira/browse/HBASE-28114
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 2.6.0, 3.0.0-beta-1
>
>
> In HBASE-28037, [~Xiaolin Ha] found that there could be a very small window 
> that even for a normal replication source, its queue could be empty.
> This is because we will only enqueue the wal file to the queue in 
> postLogRoll, where the old WAL writer has already been closed, so if the 
> replication is fast enough, we could reach the end of the queue before 
> enqueuing the new wal file.
> The code for branch-2+ has been refactored a lot so we opened a new issue for 
> fixing this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to