[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385659#comment-16385659
 ] 

Anoop Sam John commented on HBASE-20090:
----------------------------------------

The thing to be checked is why a best region been selected having heap size of 
0 when it was about to flush. The selection of best region for flush is already 
based on the heap size of the region.  There were some changes in this area 
recently about tracking heap size not data size for the flush decision etc.  
But on first look seems that area not making any new harm..  Is this 
reproducible?  If so we need to have DEBUG logs and past the near by logs also. 
 Is it like the selected region was any way about to flush (because of flush 
decision per region itself) and so by the time the global heap pressure flush  
try to do the flush, the size become zero?  There is time a gap between the 
place the best region been selected and we assign its heap size to a variable.  
All these we can clearly say if we have nearby logs.  So IMO rather than simple 
fix, we should investigate the real reason.  Finally this fix may be the one we 
can really do.. But before that pls do a thorough investigation.. I think this 
will be a good jira to look at and investigate. 

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-20090
>                 URL: https://issues.apache.org/jira/browse/HBASE-20090
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>         Attachments: 20090.v1.txt, 20090.v4.txt, 20090.v5.txt
>
>
> Here is the code in branch-2 :
> {code}
>         try {
>           wakeupPending.set(false); // allow someone to wake us up again
>           fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);
>           if (fqe == null || fqe instanceof WakeupFlushThread) {
> ...
>               if (!flushOneForGlobalPressure()) {
> ...
>           FlushRegionEntry fre = (FlushRegionEntry) fqe;
>           if (!flushRegion(fre)) {
>             break;
> ...
>         } catch (Exception ex) {
>           LOG.error("Cache flusher failed for entry " + fqe, ex);
>           if (!server.checkFileSystem()) {
>             break;
>           }
>         }
> {code}
> Inside flushOneForGlobalPressure():
> {code}
>       Preconditions.checkState(
>         (regionToFlush != null && regionToFlushSize > 0) ||
>         (bestRegionReplica != null && bestRegionReplicaSize > 0));
> {code}
> When the Preconditions check fails, IllegalStateException is caught by the 
> catch block shown above.
> However, the fqe is not flushed, resulting in potential data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to