[jira] [Commented] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

Ted Yu (JIRA) Sun, 04 Mar 2018 22:58:47 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385676#comment-16385676
 ]


Ted Yu commented on HBASE-20090:
--------------------------------

Thanks for taking a look, Anoop.

Did you read my explanation above the patch v01 QA result ?

bq. by the time the global heap pressure flush try to do the flush, the size 
become zero

There were two regions on the region server of interest, region 
0453f29030757eedb6e6a1c57e88c085 was being split.
>From the log I added, we can see that it appeared at 2018-03-02 17:28:30,298 
>with non-zero size.
However, when the following loop kicked in:
{code}
    while (!flushedOne) {
{code}
It started splitting. Therefore the other region with memstore size 0 was 
picked up.
The Precondition check failed due to 0 memstore size.

I was thinking of other ways to fix this concurrency issue but ended up picking 
what you see in patch v4. The rationale is that the region being split would 
finish splitting and become eligible for future flushing.
Temporary suspension of flushing would be lifted later.

I can dig up more of the logs tomorrow - it is late in California.

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-20090
>                 URL: https://issues.apache.org/jira/browse/HBASE-20090
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>         Attachments: 20090.v1.txt, 20090.v4.txt, 20090.v5.txt
>
>
> Here is the code in branch-2 :
> {code}
>         try {
>           wakeupPending.set(false); // allow someone to wake us up again
>           fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
>           if (fqe == null || fqe instanceof WakeupFlushThread) {
> ...
>               if (!flushOneForGlobalPressure()) {
> ...
>           FlushRegionEntry fre = (FlushRegionEntry) fqe;
>           if (!flushRegion(fre)) {
>             break;
> ...
>         } catch (Exception ex) {
>           LOG.error("Cache flusher failed for entry " + fqe, ex);
>           if (!server.checkFileSystem()) {
>             break;
>           }
>         }
> {code}
> Inside flushOneForGlobalPressure():
> {code}
>       Preconditions.checkState(
>         (regionToFlush != null && regionToFlushSize > 0) ||
>         (bestRegionReplica != null && bestRegionReplicaSize > 0));
> {code}
> When the Preconditions check fails, IllegalStateException is caught by the 
> catch block shown above.
> However, the fqe is not flushed, resulting in potential data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20090) Properly handle Preconditions check failure in MemStoreFlusher$FlushHandler.run

Reply via email to