[
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688867#comment-13688867
]
stack commented on HBASE-8701:
------------------------------
Looking at v7:
{code}
+ // If both KeyValues carry seq Id, there is no need to negate the result
of comparison
+ if (left.getMemstoreTS() < 0 && right.getMemstoreTS() < 0) {
+ return Longs.compare(left.getMemstoreTS(), right.getMemstoreTS());
+ }
{code}
Needs better comment. The value being compared is a ts but its called a seqid
in the comment. Confusing.
Woah... whats up here?
{code}
- decodeMemstoreTS =
Bytes.toLong(fileInfo.get(HFileWriterV2.MAX_MEMSTORE_TS_KEY)) > 0;
+ byte[] needDecoding =
fileInfo.get(HFileWriterV2.NEED_DECODE_MEMSTORE_TS_KEY);
+ if (needDecoding != null) {
+ decodeMemstoreTS = Bytes.toBoolean(needDecoding);
+ } else {
+ decodeMemstoreTS =
Bytes.toLong(fileInfo.get(HFileWriterV2.MAX_MEMSTORE_TS_KEY)) > 0;
+ }
{code}
Sometimes its a boolean and other times its a ts?
What is the 'decoding' that is going on here?
Regards 200M.
+ RS A crashes. It was carrying 15 WALs. The last WAL was unfinished.
+ RS B gets a region X from RS A. We open it for writes while we are
recovering this region. We add 200M to its sequenceid because this region has
seqids in excess of what the RS is currently carrying. We take in 1 edit while
recovering. We do not flush. We crash.
+ RS C recovers X. It adds 200M to the its seqid. We take in 1 edit while
recovering X.
What guarantees are there that the recovery done on RS C has seqids in excess
of those of RS B?
It seems wrong that a region would add itself to the list of recovering regions
the HRegionServer is hosting. Doesn't the HRegionServer have more context?
(And this is polluting HRegion w/ HRegionServer specifics). Who judges the
region can start accepting reads? The HRegionServer? If so, it should be
managing whether a region is in recovering state, not the HRegion itself.
Presumption here is that edits are sorted:
{code}
+ Mutation mutation = batchOp.operations[i].getFirst();
{code}
Is that safe presumption to make in replay?
Is this the least sequenceid of the batch?
No comment on why of a sudden we decide to negate the sequence number:
{code}
- kv.setMemstoreTS(localizedWriteEntry.getWriteNumber());
+ kv.setMemstoreTS(seqId == NO_SEQ_ID ?
localizedWriteEntry.getWriteNumber() :
+ -seqId);
{code}
Again, what is the difference between these two sequenceids?
{code}
private long logSeqNum;
+ // used in distributedLogReplay to store original log sequence number of an
edit
+ private long origLogSeqNum;
{code}
What is an original log seq num?
What is going on here?
{code}
+ long sequenceNum = (logKey.getOrigSequenceNumber() > 0) ?
logKey.getOrigSequenceNumber()
+ : logKey.getLogSeqNum();
{code}
The 'orig' seq number is > 0 take it? Is this 'if it is present'?
We only do this stuff for Puts and Deletes? Don't we have other types out in
the WAL?
The HLogKey gets carried into WALEdit? We have it in two places or it is just
when we instantiate the WALEdit replaying edits? Do we have to add it to
WALEdit at all?
We seem to be polluting types to carry info down into the depths of an HRegion.
> distributedLogReplay need to apply wal edits in the receiving order of those
> edits
> ----------------------------------------------------------------------------------
>
> Key: HBASE-8701
> URL: https://issues.apache.org/jira/browse/HBASE-8701
> Project: HBase
> Issue Type: Bug
> Components: MTTR
> Reporter: Jeffrey Zhong
> Assignee: Jeffrey Zhong
> Fix For: 0.98.0, 0.95.2
>
> Attachments: 8701-v3.txt, hbase-8701-v4.patch, hbase-8701-v5.patch,
> hbase-8701-v6.patch, hbase-8701-v7.patch
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts
> of the same key + version(timestamp). After replay, the value is
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and
> Put(key,t=T) in WAL2
> h5. Below is the option(proposed by Ted) I'd like to use:
> a) During replay, we pass original wal sequence number of each edit to the
> receiving RS
> b) In receiving RS, we store negative original sequence number of wal edits
> into mvcc field of KVs of wal edits
> c) Add handling of negative MVCC in KVScannerComparator and KVComparator
> d) In receiving RS, write original sequence number into an optional field of
> wal file for chained RS failure situation
> e) When opening a region, we add a safety bumper(a large number) in order for
> the new sequence number of a newly opened region not to collide with old
> sequence numbers.
> In the future, when we stores sequence number along with KVs, we can adjust
> the above solution a little bit by avoiding to overload MVCC field.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore
> should hold because we only recover unflushed wal edits. For edits with same
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with
> its key.
> c) during scanning, we use the original sequence id if it's present otherwise
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira