[ https://issues.apache.org/jira/browse/PHOENIX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275737#comment-15275737 ]
Josh Elser commented on PHOENIX-2883: ------------------------------------- Hi [~giacomotaylor]. Thanks for reaching out. bq. FYI, the index rebuild process is attempted every 10 seconds Yup, I've stumbled across that in the docs already. The initial hunch was that if the region was closed sometime in that 10s, we'd hit this inconsistency. bq. Is this for 4.7 or something else (as there were some changes as of 4.7). Sadly, it's against an 4.4 and change (a vendor creation). I've intentionally omitted the affectsVersion until we have a better understanding of what it actually affects (and that it's not just something said vendor creation missed from upstream). I'll try to put up info as I investigate this further in case anyone else has a eureka moment before I do :) > Region close during automatic disabling of index for rebuilding can lead to > RS abort > ------------------------------------------------------------------------------------ > > Key: PHOENIX-2883 > URL: https://issues.apache.org/jira/browse/PHOENIX-2883 > Project: Phoenix > Issue Type: Bug > Reporter: Josh Elser > Assignee: Josh Elser > > (disclaimer: still performing due-diligence on this one) > I've been helping a user this week with what is thought to be a race > condition in secondary index updates. This user has a relatively heavy > write-based workload with a few tables that each have at least one index. > What we have seen is that when the region distribution is changing > (concretely, we were doing a rolling restart of the cluster without the load > balancer disabled in the hopes of retaining as much availability as > possible), I've seen the following general outline in the logs: > * An index update fails (due to {{ERROR 2008 (INT10)}} the index metadata > cache expired or is just missing) > * The index is taken offline to be asynchronously rebuilt > * A flush on the data table's region is queue for quite some time > * RS is asked to close a region (due to a move, commonly) > * RS aborts because the memstore for the data table's region is in an > inconsistent state (e.g. {{Assertion failed while closing store <region> > <colfam> flushableSize expected=0, actual= 193392. Current > memstoreSize=-552208. Maybe a coprocessor operation failed and left the > memstore in a partially updated state.}} > Some relevant HBase issues include HBASE-10514 and HBASE-10844. > Have been talking to [~ayingshu] and [~devaraj] about it, but haven't found > anything definitively conclusive yet. Will dump findings here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)