[jira] [Commented] (PHOENIX-2883) Region close during automatic disabling of index for rebuilding can lead to RS abort

Josh Elser (JIRA) Wed, 11 May 2016 08:40:47 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280299#comment-15280299
 ]


Josh Elser commented on PHOENIX-2883:
-------------------------------------

Some more updates. I spent some time with [~devaraj] yesterday talking over 
this one. I believe we are both in agreement that we get into the following 
(condensed) scenario (we've seen this across a few regions in a cluster):

* The {{Indexer}}'s {{preBatchMutate}} method ends up throwing an {{Index 
update failed}} error because the server cache is missing (likely was evicted 
due to time, lots of back-up on the system).
* The next flush for that region reports that the final memstoreSize is negative
* All subsequent attempts to flush the region never run because a sanity check 
is run to see if the region has data to flush (by checking that {{memstoreSize 
> 0}}).
* A region move request eventually is received and the region is attempted to 
be closed
* The final attempt to flush is called and not run (just like the previous 
cases)
* The sanity check that each store's memstore is empty (verifying that the 
flushes ran) fail.

At this point, we haven't been able to figure out how the Region's memstore 
gets screwed up, but I have a patch I can put into HBase to more gracefully 
handle this scenario (not to mention catch any culprits that obviously screw up 
the memstore size).

> Region close during automatic disabling of index for rebuilding can lead to 
> RS abort
> ------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-2883
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2883
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>
> (disclaimer: still performing due-diligence on this one)
> I've been helping a user this week with what is thought to be a race 
> condition in secondary index updates. This user has a relatively heavy 
> write-based workload with a few tables that each have at least one index.
> What we have seen is that when the region distribution is changing 
> (concretely, we were doing a rolling restart of the cluster without the load 
> balancer disabled in the hopes of retaining as much availability as 
> possible), I've seen the following general outline in the logs:
> * An index update fails (due to {{ERROR 2008 (INT10)}} the index metadata 
> cache expired or is just missing)
> * The index is taken offline to be asynchronously rebuilt
> * A flush on the data table's region is queue for quite some time
> * RS is asked to close a region (due to a move, commonly)
> * RS aborts because the memstore for the data table's region is in an 
> inconsistent state (e.g. {{Assertion failed while closing store <region> 
> <colfam> flushableSize expected=0, actual= 193392. Current 
> memstoreSize=-552208. Maybe a coprocessor operation failed and left the 
> memstore in a partially updated state.}}
> Some relevant HBase issues include HBASE-10514 and HBASE-10844.
> Have been talking to [~ayingshu] and [~devaraj] about it, but haven't found 
> anything definitively conclusive yet. Will dump findings here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-2883) Region close during automatic disabling of index for rebuilding can lead to RS abort

Reply via email to