[
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13687049#comment-13687049
]
Himanshu Vashishtha commented on HBASE-8701:
--------------------------------------------
I have a follow up question on this -ve mvcc stuff.
Current flusher resets the mvcc point to 0 if it is older than the minimum
readpoint across all the scanners. There are two cases here:
a) Region is under recovery mode: The minimum readpoint will be maxSequenceIds
(obtained from the StoreFiles) + 1, as the region is not available for read
yet. Thus, it will reset this -ve sequenceId to 0? Flushed hfiles under
recovery will not be having -ve numbers?
Let's say we changed this and keep the -ve numbers intact.
b) Recovery is completed, and region is available for read. There might be some
scanners open and we would now have some legit min readpoint.
i) I see that as part of optimization, we set memstoreTS to 0 in case it is <
MVCC.readpoint (even while simple scan).
Do we need to remove that optimization now?
ii) How we handle the Deletes now. I see SQM comparing memstoreTS with
mvccReadPoint at some places (see the match method). Especially settings where
it wants to seePastDeleteMarkers, or the CF level attribute of retaining Delete
markers.
[~lhofhansl], what you think about this delete handling with -ve mvcc values?
Please let me know what you think of the above concerns. In case I missed
something, please correct me.
Thanks.
> distributedLogReplay need to apply wal edits in the receiving order of those
> edits
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8701
> URL: https://issues.apache.org/jira/browse/HBASE-8701
> Project: HBase
> Issue Type: Bug
> Components: MTTR
> Reporter: Jeffrey Zhong
> Assignee: Jeffrey Zhong
> Fix For: 0.98.0, 0.95.2
>
> Attachments: 8701-v3.txt, hbase-8701-v4.patch, hbase-8701-v5.patch,
> hbase-8701-v6.patch
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts
> of the same key + version(timestamp). After replay, the value is
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and
> Put(key,t=T) in WAL2
> h5. Below is the option(proposed by Ted) I'd like to use:
> a) During replay, we pass original wal sequence number of each edit to the
> receiving RS
> b) In receiving RS, we store negative original sequence number of wal edits
> into mvcc field of KVs of wal edits
> c) Add handling of negative MVCC in KVScannerComparator and KVComparator
> d) In receiving RS, write original sequence number into an optional field of
> wal file for chained RS failure situation
> e) When opening a region, we add a safety bumper(a large number) in order for
> the new sequence number of a newly opened region not to collide with old
> sequence numbers.
> In the future, when we stores sequence number along with KVs, we can adjust
> the above solution a little bit by avoiding to overload MVCC field.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore
> should hold because we only recover unflushed wal edits. For edits with same
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with
> its key.
> c) during scanning, we use the original sequence id if it's present otherwise
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira