Josh Elser created PHOENIX-2883:
-----------------------------------

             Summary: Region close during automatic disabling of index for 
rebuilding can lead to RS abort
                 Key: PHOENIX-2883
                 URL: https://issues.apache.org/jira/browse/PHOENIX-2883
             Project: Phoenix
          Issue Type: Bug
            Reporter: Josh Elser
            Assignee: Josh Elser


(disclaimer: still performing due-diligence on this one)

I've been helping a user this week with what is thought to be a race condition 
in secondary index updates. This user has a relatively heavy write-based 
workload with a few tables that each have at least one index.

What we have seen is that when the region distribution is changing (concretely, 
we were doing a rolling restart of the cluster without the load balancer 
disabled in the hopes of retaining as much availability as possible), I've seen 
the following general outline in the logs:

* An index update fails (due to {{ERROR 2008 (INT10)}} the index metadata cache 
expired or is just missing)
* The index is taken offline to be asynchronously rebuilt
* A flush on the data table's region is queue for quite some time
* RS is asked to close a region (due to a move, commonly)
* RS aborts because the memstore for the data table's region is in an 
inconsistent state (e.g. {{Assertion failed while closing store <region> 
<colfam> flushableSize expected=0, actual= 193392. Current 
memstoreSize=-552208. Maybe a coprocessor operation failed and left the 
memstore in a partially updated state.}}

Some relevant HBase issues include HBASE-10514 and HBASE-10844.

Have been talking to [~ayingshu] and [~devaraj] about it, but haven't found 
anything definitively conclusive yet. Will dump findings here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to