[ 
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13684860#comment-13684860
 ] 

Jeffrey Zhong commented on HBASE-8701:
--------------------------------------

Thanks [~ted_yu] for the new proposal which looks very promising because it 
allows us to keep current hfile format and solve same version update ordering 
issue of this JIRA.

I'll add remaining small changes(documented below) on top of your patch & a 
test case.

Along storing negative original sequence numbers into mvcc field, we need to 
extend the wal so that wal keys of edits created by replay command store the 
original log sequence number to handle chained RS failure situation.
In addition, in order to accept writes during recovering, we need to get the 
largest log sequence number from previous failed RS. There are several options 
to address that:

1) add a large number to the max flushed sequenced number of store files of the 
failed region so that the new sequence number won't collide with old sequence 
value.(my favor option)
For example, adding 200 millions on top of max store file sequence id:
* it'd take 300+ years to overflow long integer assuming the same region 
recovers every second
* it'd take 2+ days for a RS receives a change every millisecond and without a 
single flush

2) reject puts with explicit timestamp input during recovery
3) read through the last wal(may also the trailer of the second to last wal) to 
get the max sequence number. The disadvantage of this approach is recovery 
process is blocked till after reading the two possible wals. The recovery lease 
of the last wal may incur some time because it's most likely open when RS 
fails. 

 
                
> distributedLogReplay need to apply wal edits in the receiving order of those 
> edits
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts 
> of the same key + version(timestamp). After replay, the value is 
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and 
> Put(key,t=T) in WAL2
> h5. Below is the option I'd like to use:
> a) During replay, we pass wal file name hash in each replay batch and 
> original wal sequence id of each edit to the receiving RS
> b) Once a wal is recovered, playing RS send a signal to the receiving RS so 
> the receiving RS can flush
> c) In receiving RS, different WAL file of a region sends edits to different 
> memstores.(We can visualize this in high level as sending changes to a new 
> region object with name(origin region name + wal name hash) and use the 
> original sequence Ids.) 
> d) writes from normal traffic(allow writes during recovery) are put in normal 
> memstores as of today and flush normally with new sequenceIds.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore 
> should hold because we only recover unflushed wal edits. For edits with same 
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with 
> its key. 
> c) during scanning, we use the original sequence id if it's present otherwise 
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to