[ 
https://issues.apache.org/jira/browse/HBASE-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024157#comment-16024157
 ] 

Jerry He commented on HBASE-18099:
----------------------------------

bq. This check acts as safety guard. Normally waitForFlushes() would get into 
loop and return non-zero value.
There is no really a need to have this safety guard. What is the concrete case 
which it guards against?
It is also a little strange to use a duration/time-lapse as indicator if an op 
is successful or not.  It can be zero but still valid.
For example, in the patch:
{code}
99          FlushResult res = region.flush(true);
100         if (res.getResult() == FlushResult.Result.CANNOT_FLUSH) {
101           long duration = region.waitForFlushes();
{code}
Between line 100 and line 101, the third-party flush can be over and 
writestate.flushing set to false.  Then the waitForFlushes is a no-op with 
duration zero. We don't want to fail the snapshot in this case.

We can just let region.waitForFlushes() be void and call it in line 101 without 
checking anything.
Please add a comment above line 100 to tell why we need to do line 101 -- there 
may be another ongoing flush for the same region we want to wait for.

> FlushSnapshotSubprocedure should check the return value from Region#flush()
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-18099
>                 URL: https://issues.apache.org/jira/browse/HBASE-18099
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>         Attachments: 18099.v1.txt, 18099.v2.txt, 18099.v3.txt
>
>
> In the following thread:
> http://search-hadoop.com/m/HBase/YGbbMXkeHlI9zo
> Jacob described the scenario where data from certain region were missing in 
> the snapshot.
> Here was related region server log:
> https://pastebin.com/1ECXjhRp
> He pointed out that concurrent flush from MemStoreFlusher.1 thread was not 
> initiated from the thread pool for snapshot.
> In RegionSnapshotTask#call() method there is this:
> {code}
>           region.flush(true);
> {code}
> The return value is not checked.
> In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
> {code}
>           String msg = "Not flushing since "
>               + (writestate.flushing ? "already flushing"
>               : "writes not enabled");
> {code}
> This implies that FlushSnapshotSubprocedure may incorrectly skip waiting for 
> the concurrent flush to complete.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to