[ 
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-8701:
---------------------------------

    Attachment: hbase-8701-v5.patch

[[email protected]] Thanks for the comments!

In the v5 patch, I added a new verification step after flush.

{quote}
Sometimes the mvcc number is a sequence number (a negative one!) and other 
times it is an mvcc. This hack is spread about the code base.
{quote}
I have to admit that overloading mvcc number is a hack but it allows us without 
modifying hfile format by adding minimum changes(you can see Ted's v3 patch is 
only 9kb) to address the JIRA. I hope once the cell tag is in place we can 
clean the hack with trivial effort. 

{quote}
The 200M here is meant to span all edits out in WAL logs?
{quote}
Yes. It's just a big enough number to make sure new sequence number won't 
collide with old log sequence number without reading wal files. A RS is extrem 
unlikely to have 200 million changes before a region flush because we flush a 
online region every hr by default and other logic to force a flush on regions 
with min sequence number when the number of log files reach certain limit.
Beging said that, we can leave this out and till we have a consensus in 
hbase-8741.

{quote}
An HLogEdit doesn't have sequence number already? What is logSeqNum? What is 
relation to below?
{quote}
It's the original log sequence number when firstly replay a wal. Storing in 
waledit so that we can persistent the number into hlogkey of a wal Entry to 
handle the case when receiving RS fails again during a replay.

{quote}
Chatting w/ Himanshu, he wondered if it is possible that a memstore get flushed 
w/ a negative mvcc?
{quote}
It's possible due to we have sequence number along with the KV.
                
> distributedLogReplay need to apply wal edits in the receiving order of those 
> edits
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: 8701-v3.txt, hbase-8701-v4.patch, hbase-8701-v5.patch
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts 
> of the same key + version(timestamp). After replay, the value is 
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and 
> Put(key,t=T) in WAL2
> h5. Below is the option(proposed by Ted) I'd like to use:
> a) During replay, we pass original wal sequence number of each edit to the 
> receiving RS
> b) In receiving RS, we store negative original sequence number of wal edits 
> into mvcc field of KVs of wal edits
> c) Add handling of negative MVCC in KVScannerComparator and KVComparator   
> d) In receiving RS, write original sequence number into an optional field of 
> wal file for chained RS failure situation 
> e) When opening a region, we add a safety bumper(a large number) in order for 
> the new sequence number of a newly opened region not to collide with old 
> sequence numbers. 
> In the future, when we stores sequence number along with KVs, we can adjust 
> the above solution a little bit by avoiding to overload MVCC field.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore 
> should hold because we only recover unflushed wal edits. For edits with same 
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with 
> its key. 
> c) during scanning, we use the original sequence id if it's present otherwise 
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to