[ 
https://issues.apache.org/jira/browse/HBASE-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285082#comment-15285082
 ] 

Josh Elser commented on HBASE-15837:
------------------------------------

bq. Crashing when holding data that's unexpected seems like the correct thing 
to do

Without looking at the code, I would have agreed with you; however, after 
taking a look at how it's written I think it's just bad accounting. The check 
is written to verify that the flush that we tried to run after grabbing the 
writeLock actually ran successfully (e.g. there should be no chance that any 
more data exists). The fact that we're using {{memstoreSize}} as the judge of 
whether or not to run actually run the flush, but then checking the size of 
each Store seems goofy as well (leading us to this split on the truth).

Given that coprocessors could be loaded which could unintentionally mess things 
up (not to mention internal bugs), forcing down the RS seems very invasive to 
me. I'm attaching a patch once I finish typing this -- let me know what you 
think. IMO, this feels pretty safe to me given that we know we're controlling 
all access to the region at this point in time.

> More gracefully handle a negative memstoreSize
> ----------------------------------------------
>
>                 Key: HBASE-15837
>                 URL: https://issues.apache.org/jira/browse/HBASE-15837
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 2.0.0
>
>
> Over in PHOENIX-2883, I've been trying to figure out how to track down the 
> root cause of an issue we were seeing where a negative memstoreSize was 
> ultimately causing an RS to abort. The tl;dr version is
> * Something causes memstoreSize to be negative (not sure what is doing this 
> yet)
> * All subsequent flushes short-circuit and don't run because they think there 
> is no data to flush
> * The region is eventually closed (commonly, for a move).
> * A final flush is attempted on each store before closing (which also 
> short-circuit for the same reason), leaving unflushed data in each store.
> * The sanity check that each store's size is zero fails and the RS aborts.
> I have a little patch which I think should improve our failure case around 
> this, preventing the RS abort safely (forcing a flush when memstoreSize is 
> negative) and logging a calltrace when an update to memstoreSize make it 
> negative (to find culprits in the future).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to