[ 
https://issues.apache.org/jira/browse/PHOENIX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274836#comment-15274836
 ] 

Josh Elser commented on PHOENIX-2883:
-------------------------------------

However, a competing theory I'm developing is related to flushes which aren't 
being run (for some reason). Originally, there was also an 'Index update 
failed' error message, but sometimes the last time of that message before the 
abort is on the order of hours. This seems like it might be a red herring.

However, grep'ing the RS log down to just messages containing a region 
identifier, we can see that the server is trying every 10s to flush the region, 
then, eventually leading to a close.

{noformat}
2016-04-06 12:53:21,639 INFO  [<servername>_ChoreService_1] 
regionserver.HRegionServer: <servername>-MemstoreFlusherChore requesting flush 
for region <regionname> after a delay of 3494
2016-04-06 12:53:25,133 DEBUG [MemStoreFlusher.0] regionserver.HRegion: Column 
Family: A of region <regionname> will be flushed because time of its oldest 
edit (1459955437621) is far away from now(1459965205133), max allowed is 3600000
2016-04-06 12:53:31,639 INFO  [<servername>_ChoreService_1] 
regionserver.HRegionServer: <servername>-MemstoreFlusherChore requesting flush 
for region <regionname> after a delay of 19641
2016-04-06 12:53:41,639 INFO  [<servername>_ChoreService_1] 
regionserver.HRegionServer: <servername>-MemstoreFlusherChore requesting flush 
for region <regionname> after a delay of 19136
2016-04-06 12:53:51,282 DEBUG [MemStoreFlusher.0] regionserver.HRegion: Column 
Family: A of region <regionname> will be flushed because time of its oldest 
edit (1459955437621) is far away from now(1459965231282), max allowed is 3600000
2016-04-06 12:53:51,639 INFO  [<servername>_ChoreService_1] 
regionserver.HRegionServer: <servername>-MemstoreFlusherChore requesting flush 
for region <regionname> after a delay of 20967
2016-04-06 12:54:01,640 INFO  [<servername>_ChoreService_1] 
regionserver.HRegionServer: <servername>-MemstoreFlusherChore requesting flush 
for region <regionname> after a delay of 16080
2016-04-06 12:54:11,640 INFO  [<servername>_ChoreService_1] 
regionserver.HRegionServer: <servername>-MemstoreFlusherChore requesting flush 
for region <regionname> after a delay of 18286
2016-04-06 12:54:12,606 DEBUG [MemStoreFlusher.1] regionserver.HRegion: Column 
Family: A of region <regionname> will be flushed because time of its oldest 
edit (1459955437621) is far away from now(1459965252606), max allowed is 3600000
2016-04-06 12:54:21,640 INFO  [<servername>_ChoreService_1] 
regionserver.HRegionServer: <servername>-MemstoreFlusherChore requesting flush 
for region <regionname> after a delay of 10057
2016-04-06 12:54:33,039 DEBUG [RS_CLOSE_REGION-<server>-1] 
handler.CloseRegionHandler: Processing close of <regionname>
2016-04-06 12:54:33,040 DEBUG [RS_CLOSE_REGION-<server>-1] 
regionserver.HRegion: Closing <regionname>: disabling compactions & flushes
2016-04-06 12:54:33,040 DEBUG [RS_CLOSE_REGION-<server>-1] 
regionserver.HRegion: Updates disabled for region <regionname>
2016-04-06 12:54:33,040 FATAL [RS_CLOSE_REGION-<server>-1] 
regionserver.HRegionServer: ABORTING region server <servername>: Assertion 
failed while closing store <regionname> A. flushableSize expected=0, actual= 
87384. Current memstoreSize=-24024. Maybe a coprocessor operation failed and 
left the memstore in a partially updated state.
2016-04-06 12:54:33,106 INFO  [StoreCloserThread-<regionname>-1] 
regionserver.HStore: Closed A
2016-04-06 12:54:33,117 DEBUG [RS_CLOSE_REGION-<server>-1] wal.WALSplitter: 
Wrote region seqId=hdfs://<path>/recovered.edits/442148.seqid to file, 
newSeqId=442148, maxSeqId=8
2016-04-06 12:54:33,119 INFO  [RS_CLOSE_REGION-<server>-1] 
regionserver.HRegion: Closed <regionname>
2016-04-06 12:54:33,120 DEBUG [RS_CLOSE_REGION-<server>-1] 
handler.CloseRegionHandler: Closed <regionname>
{noformat}

Perhaps it is still related to a coproc doing something that leaves the 
memstore in a bad state, just not related to secondary indexing? Not sure..

> Region close during automatic disabling of index for rebuilding can lead to 
> RS abort
> ------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-2883
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2883
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>
> (disclaimer: still performing due-diligence on this one)
> I've been helping a user this week with what is thought to be a race 
> condition in secondary index updates. This user has a relatively heavy 
> write-based workload with a few tables that each have at least one index.
> What we have seen is that when the region distribution is changing 
> (concretely, we were doing a rolling restart of the cluster without the load 
> balancer disabled in the hopes of retaining as much availability as 
> possible), I've seen the following general outline in the logs:
> * An index update fails (due to {{ERROR 2008 (INT10)}} the index metadata 
> cache expired or is just missing)
> * The index is taken offline to be asynchronously rebuilt
> * A flush on the data table's region is queue for quite some time
> * RS is asked to close a region (due to a move, commonly)
> * RS aborts because the memstore for the data table's region is in an 
> inconsistent state (e.g. {{Assertion failed while closing store <region> 
> <colfam> flushableSize expected=0, actual= 193392. Current 
> memstoreSize=-552208. Maybe a coprocessor operation failed and left the 
> memstore in a partially updated state.}}
> Some relevant HBase issues include HBASE-10514 and HBASE-10844.
> Have been talking to [~ayingshu] and [~devaraj] about it, but haven't found 
> anything definitively conclusive yet. Will dump findings here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to