[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Kannan Muthukkaruppan (JIRA) Wed, 26 Jan 2011 01:03:09 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986900#action_12986900
 ]


Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Maybe the quickest fix is to avoid the "skip" optimization during replaying of 
recovered.edits.

I think this should restore correctness. 

And with regards to HLog reclamation (i.e. an HLog should only be reclaimed if 
it contains no data for an active memstore), I don't think it relies on this 
MAX_SEQ_ID inside store files-- but rather on separate mechanism of what the 
min edit contained in each memstore is. So, probably that case is ok.

> max seq id in flushed file can be larger than its correct value causing data 
> loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log 
> recovery. Upon investigating further, and pretty printing contents of HFiles 
> and recovered logs, this is my analysis of the situation/bug. Please confirm 
> the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should 
> be the max sequence id of all KVs in the memstore. However, we seem to be 
> simply obtain the current sequence id from the HRegion, and stamp the HFile's 
> MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
>     public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in 
> pseudo-code)
> {code}
>    1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
>    2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to 
> say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is 
> still in progress. In this case, none of these edits (11..15) would have been 
> written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its 
> MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would 
> return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the 
> data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log 
> recovery, the splits all go through correctly, and a correct recovered.edits 
> file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file 
> says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  
> skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Reply via email to