[
https://issues.apache.org/jira/browse/HBASE-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024157#comment-16024157
]
Jerry He commented on HBASE-18099:
----------------------------------
bq. This check acts as safety guard. Normally waitForFlushes() would get into
loop and return non-zero value.
There is no really a need to have this safety guard. What is the concrete case
which it guards against?
It is also a little strange to use a duration/time-lapse as indicator if an op
is successful or not. It can be zero but still valid.
For example, in the patch:
{code}
99 FlushResult res = region.flush(true);
100 if (res.getResult() == FlushResult.Result.CANNOT_FLUSH) {
101 long duration = region.waitForFlushes();
{code}
Between line 100 and line 101, the third-party flush can be over and
writestate.flushing set to false. Then the waitForFlushes is a no-op with
duration zero. We don't want to fail the snapshot in this case.
We can just let region.waitForFlushes() be void and call it in line 101 without
checking anything.
Please add a comment above line 100 to tell why we need to do line 101 -- there
may be another ongoing flush for the same region we want to wait for.
> FlushSnapshotSubprocedure should check the return value from Region#flush()
> ---------------------------------------------------------------------------
>
> Key: HBASE-18099
> URL: https://issues.apache.org/jira/browse/HBASE-18099
> Project: HBase
> Issue Type: Bug
> Reporter: Ted Yu
> Attachments: 18099.v1.txt, 18099.v2.txt, 18099.v3.txt
>
>
> In the following thread:
> http://search-hadoop.com/m/HBase/YGbbMXkeHlI9zo
> Jacob described the scenario where data from certain region were missing in
> the snapshot.
> Here was related region server log:
> https://pastebin.com/1ECXjhRp
> He pointed out that concurrent flush from MemStoreFlusher.1 thread was not
> initiated from the thread pool for snapshot.
> In RegionSnapshotTask#call() method there is this:
> {code}
> region.flush(true);
> {code}
> The return value is not checked.
> In HRegion#flushcache(), Result.CANNOT_FLUSH may be returned due to:
> {code}
> String msg = "Not flushing since "
> + (writestate.flushing ? "already flushing"
> : "writes not enabled");
> {code}
> This implies that FlushSnapshotSubprocedure may incorrectly skip waiting for
> the concurrent flush to complete.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)