[ https://issues.apache.org/jira/browse/HBASE-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Elser updated HBASE-15837: ------------------------------- Attachment: HBASE-15837.001.patch .001 A first stab at avoiding the RS crash. The general goals are to # Determine who screwed up the memstoreSize in the first place # Avoid data loss when memstoreSize is wrong If a store does fail to flush successfully, the RS should still crash. The logic is just fixing the logic so that memstoreSize being negative doesn't prevent a Store's flush and cause the RS abort. > More gracefully handle a negative memstoreSize > ---------------------------------------------- > > Key: HBASE-15837 > URL: https://issues.apache.org/jira/browse/HBASE-15837 > Project: HBase > Issue Type: Improvement > Components: regionserver > Reporter: Josh Elser > Assignee: Josh Elser > Fix For: 2.0.0 > > Attachments: HBASE-15837.001.patch > > > Over in PHOENIX-2883, I've been trying to figure out how to track down the > root cause of an issue we were seeing where a negative memstoreSize was > ultimately causing an RS to abort. The tl;dr version is > * Something causes memstoreSize to be negative (not sure what is doing this > yet) > * All subsequent flushes short-circuit and don't run because they think there > is no data to flush > * The region is eventually closed (commonly, for a move). > * A final flush is attempted on each store before closing (which also > short-circuit for the same reason), leaving unflushed data in each store. > * The sanity check that each store's size is zero fails and the RS aborts. > I have a little patch which I think should improve our failure case around > this, preventing the RS abort safely (forcing a flush when memstoreSize is > negative) and logging a calltrace when an update to memstoreSize make it > negative (to find culprits in the future). -- This message was sent by Atlassian JIRA (v6.3.4#6332)