[jira] [Commented] (PHOENIX-2883) Region close during automatic disabling of index for rebuilding can lead to RS abort

Josh Elser (JIRA) Mon, 09 May 2016 15:43:47 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277222#comment-15277222
 ]


Josh Elser commented on PHOENIX-2883:
-------------------------------------

I've been looking at this some more today, and I'm thinking that it might be 
purely a recording issue at this point (at least, I have no reason to believe 
that the memstore was actually in a bad state).

{noformat}
2016-05-04 08:11:53,615 INFO  [MemStoreFlusher.0] regionserver.HRegion: Started 
memstore flush for <regionname>, current region memstore size 1.48 MB, and 1/2 
column families' memstores are being flushed.
2016-05-04 08:11:53,615 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
Flushing Column Family: A which was occupying 2.93 MB of memstore.
2016-05-04 08:11:53,665 INFO  [MemStoreFlusher.0] 
regionserver.DefaultStoreFlusher: Flushed, sequenceid=679546395, memsize=2.9 M, 
hasBloomFilter=true, into tmp file 
hdfs://path-to-region/.tmp/831a68bcc7a94bcbae824280b1583415
2016-05-04 08:11:53,682 DEBUG [MemStoreFlusher.0] 
regionserver.HRegionFileSystem: Committing store file 
hdfs://path-to-region/.tmp/831a68bcc7a94bcbae824280b1583415 as 
hdfs://path-to-region/A/831a68bcc7a94bcbae824280b1583415
2016-05-04 08:11:53,697 INFO  [MemStoreFlusher.0] regionserver.HStore: Added 
hdfs://path-to-region/A/831a68bcc7a94bcbae824280b1583415, entries=10864, 
sequenceid=679546395, filesize=89.7 K
2016-05-04 08:11:53,699 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
Finished memstore flush of ~2.93 MB/3071056, currentsize=-1.45 MB/-1521720 for 
region <regionname> in 85ms, sequenceid=679546395, compaction requested=true
{noformat}

Somehow, HBase is telling us that the memstore size of a column family is more 
than twice as large as the memstore for the entire region. The final memstore 
size for the region certainly comes from those two numbers. One of those two 
sizes got messed up for sure (either region's memstore size decreased or the 
CF's memstore size increased when it shouldn't have). Will try to dig into that 
one next.

> Region close during automatic disabling of index for rebuilding can lead to 
> RS abort
> ------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-2883
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2883
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>
> (disclaimer: still performing due-diligence on this one)
> I've been helping a user this week with what is thought to be a race 
> condition in secondary index updates. This user has a relatively heavy 
> write-based workload with a few tables that each have at least one index.
> What we have seen is that when the region distribution is changing 
> (concretely, we were doing a rolling restart of the cluster without the load 
> balancer disabled in the hopes of retaining as much availability as 
> possible), I've seen the following general outline in the logs:
> * An index update fails (due to {{ERROR 2008 (INT10)}} the index metadata 
> cache expired or is just missing)
> * The index is taken offline to be asynchronously rebuilt
> * A flush on the data table's region is queue for quite some time
> * RS is asked to close a region (due to a move, commonly)
> * RS aborts because the memstore for the data table's region is in an 
> inconsistent state (e.g. {{Assertion failed while closing store <region> 
> <colfam> flushableSize expected=0, actual= 193392. Current 
> memstoreSize=-552208. Maybe a coprocessor operation failed and left the 
> memstore in a partially updated state.}}
> Some relevant HBase issues include HBASE-10514 and HBASE-10844.
> Have been talking to [~ayingshu] and [~devaraj] about it, but haven't found 
> anything definitively conclusive yet. Will dump findings here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-2883) Region close during automatic disabling of index for rebuilding can lead to RS abort

Reply via email to