[ 
https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691616#comment-13691616
 ] 

stack commented on HBASE-8701:
------------------------------

Pardon me.  I see that the 'original seqnum' IS serialized, written into WALKey.

In KV, why there are would be -ve mvcc needs to be explained.

MINOR_VERSION_WITH_MVCC_SEQ_ID_UNION is defined but not used?

What is 'decoding'?  That it is -ve?

{code}
+  /** whether memstore (mvcc) timestamp field needs decoding in reader */
{code}

What is this?  Should it be test for -ve?

{code}
+    if (kv.getMemstoreTS() != 0) {
+      this.needDecodeMemstoreTS = true;
+    }
{code}

Seems like any non-null mvcc needs 'decoding'.  If so, do we need this flag 
stuff going on?  (Sorry for the dumb questions... just trying to figure whats 
going on here).

Could you just do this in reader:

decodeMemstoreTS = 
Bytes.toLong(fileInfo.get(HFileWriterV2.MAX_MEMSTORE_TS_KEY)) != 0;

... and have to mess in Writer?

I have already remarked on  how it is odd that HRegion adds itself to the 
recovering region list -- that the regionserver should be doing this, not the 
hregion (it used to be done in OpenRegionHandler?)

If the log replay edits skip WAL, we don't have to carry this 'original seqid'? 
 But doing it this way, we can archive WALs as we finish their replay and if 
the RS we are playing into dies, we can then play any left over WALs... and any 
WALs it may have accummulated in meantime.


On this:

{code}
-          this.comparator.compare(kvNext, topScanner.peek()) >= 0) {
+          this.comparator.compare(kvNext, this.current.getSequenceID(),
+            topScanner.peek(), topScanner.getSequenceID()) >= 0) {
{code}

Do you think above a bug fix?  It looks like hfile is seqid of 0 and memstore 
is seqid of MAX_VALUE.  Are they ever different?  Does the 'orig seqid' ever 
get in the mix here?

Is this method used in a few places?

+    public int compare(KeyValue leftKV, long leftSeqId, KeyValue rightKV, long 
rightSeqId) {

Previous we'd get the seqid internal from the left and right key but not we 
allow the seqid being passed independent of the kvs being compared.  Is that 
what we want?  Why open up this compare?  For the scan compares on the heap?

This is pretty fundamental change.  Compares mvcc first, then seqid.

And if the mvcc is < 0 in the below, what happens?

{code}
-          if (kv.getMemstoreTS() <= smallestReadPoint) {
+          if (kv.getMemstoreTS() >= 0 && kv.getMemstoreTS() <= 
smallestReadPoint) {
{code}

ditto here:

{code}
-        if (kv.getMemstoreTS() <= smallestReadPoint) {
+        if (kv.getMemstoreTS() > 0 && kv.getMemstoreTS() <= smallestReadPoint) 
{
{code}


In general, little explanation in the patch on why things are as they are, the 
HRegion SEQNUM_SAFETY_BUMPER seqid counting needs to be done still (yeah, that 
is different issue), but high-level, this adds a load of complexity; too hard 
to follow what is going on (I am still not done w/ review; next up would be 
trying different failure scenarios).

bq. The sequence ids of hfile are intact as before.

This means that the seqid we flush with is never -ve?


                
> distributedLogReplay need to apply wal edits in the receiving order of those 
> edits
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: 8701-v3.txt, hbase-8701-v4.patch, hbase-8701-v5.patch, 
> hbase-8701-v6.patch, hbase-8701-v7.patch
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts 
> of the same key + version(timestamp). After replay, the value is 
> nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and 
> Put(key,t=T) in WAL2
> h5. Below is the option(proposed by Ted) I'd like to use:
> a) During replay, we pass original wal sequence number of each edit to the 
> receiving RS
> b) In receiving RS, we store negative original sequence number of wal edits 
> into mvcc field of KVs of wal edits
> c) Add handling of negative MVCC in KVScannerComparator and KVComparator   
> d) In receiving RS, write original sequence number into an optional field of 
> wal file for chained RS failure situation 
> e) When opening a region, we add a safety bumper(a large number) in order for 
> the new sequence number of a newly opened region not to collide with old 
> sequence numbers. 
> In the future, when we stores sequence number along with KVs, we can adjust 
> the above solution a little bit by avoiding to overload MVCC field.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore 
> should hold because we only recover unflushed wal edits. For edits with same 
> key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with 
> its key. 
> c) during scanning, we use the original sequence id if it's present otherwise 
> its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to