[jira] [Updated] (HBASE-8701) distributedLogReplay need to apply wal edits in the receiving order of those edits

Jeffrey Zhong (JIRA) Mon, 17 Jun 2013 13:55:22 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeffrey Zhong updated HBASE-8701:
---------------------------------

    Description: 
This issue happens in distributedLogReplay mode when recovering multiple puts 
of the same key + version(timestamp). After replay, the value is 
nondeterministic of the key

h5. The original concern situation raised from [~eclark]:

For all edits the rowkey is the same.
There's a log with: [ A (ts = 0), B (ts = 0) ]
Replay the first half of the log.
A user puts in C (ts = 0)
Memstore has to flush
A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
Replay the rest of the Log.
Flush

The issue will happen in similar situation like Put(key, t=T) in WAL1 and 
Put(key,t=T) in WAL2

h5. Below is the option(proposed by Ted) I'd like to use:

a) During replay, we pass original wal sequence number of each edit to the 
receiving RS
b) In receiving RS, we store negative original sequence number of wal edits 
into mvcc field of KVs of wal edits
c) Add handling of negative MVCC in KVScannerComparator and KVComparator   
d) In receiving RS, write original sequence number into an optional field of 
wal file for chained RS failure situation 
e) When opening a region, we add a safety bumper(a large number) in order for 
the new sequence number of a newly opened region won't collide with old 
sequence numbers. 

In the future, when we stores sequence number along with KVs, we can adjust the 
above solution a little bit by avoiding to overload MVCC field.

h5. The other alternative options are listed below for references:

Option one
a) disallow writes during recovery
b) during replay, we pass original wal sequence ids
c) hold flush till all wals of a recovering region are replayed. Memstore 
should hold because we only recover unflushed wal edits. For edits with same 
key + version, whichever with larger sequence Id wins.

Option two
a) During replay, we pass original wal sequence ids
b) for each wal edit, we store each edit's original sequence id along with its 
key. 
c) during scanning, we use the original sequence id if it's present otherwise 
its store file sequence Id
d) compaction can just leave put with max sequence id

Please let me know if you have better ideas.

  was:
This issue happens in distributedLogReplay mode when recovering multiple puts 
of the same key + version(timestamp). After replay, the value is 
nondeterministic of the key

h5. The original concern situation raised from [~eclark]:

For all edits the rowkey is the same.
There's a log with: [ A (ts = 0), B (ts = 0) ]
Replay the first half of the log.
A user puts in C (ts = 0)
Memstore has to flush
A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
Replay the rest of the Log.
Flush

The issue will happen in similar situation like Put(key, t=T) in WAL1 and 
Put(key,t=T) in WAL2

h5. Below is the option I'd like to use:

a) During replay, we pass wal file name hash in each replay batch and original 
wal sequence id of each edit to the receiving RS
b) Once a wal is recovered, playing RS send a signal to the receiving RS so the 
receiving RS can flush
c) In receiving RS, different WAL file of a region sends edits to different 
memstores.(We can visualize this in high level as sending changes to a new 
region object with name(origin region name + wal name hash) and use the 
original sequence Ids.) 
d) writes from normal traffic(allow writes during recovery) are put in normal 
memstores as of today and flush normally with new sequenceIds.

h5. The other alternative options are listed below for references:

Option one
a) disallow writes during recovery
b) during replay, we pass original wal sequence ids
c) hold flush till all wals of a recovering region are replayed. Memstore 
should hold because we only recover unflushed wal edits. For edits with same 
key + version, whichever with larger sequence Id wins.

Option two
a) During replay, we pass original wal sequence ids
b) for each wal edit, we store each edit's original sequence id along with its 
key. 
c) during scanning, we use the original sequence id if it's present otherwise 
its store file sequence Id
d) compaction can just leave put with max sequence id

Please let me know if you have better ideas.

    
> distributedLogReplay need to apply wal edits in the receiving order of those 
> edits
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: 8701-v3.txt
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts 
> of the same key + version(timestamp). After replay, the value is 
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and 
> Put(key,t=T) in WAL2
> h5. Below is the option(proposed by Ted) I'd like to use:
> a) During replay, we pass original wal sequence number of each edit to the 
> receiving RS
> b) In receiving RS, we store negative original sequence number of wal edits 
> into mvcc field of KVs of wal edits
> c) Add handling of negative MVCC in KVScannerComparator and KVComparator   
> d) In receiving RS, write original sequence number into an optional field of 
> wal file for chained RS failure situation 
> e) When opening a region, we add a safety bumper(a large number) in order for 
> the new sequence number of a newly opened region won't collide with old 
> sequence numbers. 
> In the future, when we stores sequence number along with KVs, we can adjust 
> the above solution a little bit by avoiding to overload MVCC field.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore 
> should hold because we only recover unflushed wal edits. For edits with same 
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with 
> its key. 
> c) during scanning, we use the original sequence id if it's present otherwise 
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-8701) distributedLogReplay need to apply wal edits in the receiving order of those edits

Reply via email to