[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file

ramkrishna.s.vasudevan (JIRA) Mon, 18 Aug 2014 00:27:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100384#comment-14100384
 ]


ramkrishna.s.vasudevan commented on HBASE-11591:
------------------------------------------------

I got a clean QA run.
bq.isBulkLoadResult -> isBulkLoaded()? For setter also?
Okie. Fine with that.
bq.I see this isBulkLoadResult () in StoreFile.java level also. I would have 
been better to know this status from StoreFile rather than from StoreFileReader.
I spent some time for doing it.  Later decided this way.First thing is that 
only the reader is passed to the StoreFileScanner and storefilescanner only has 
a reader associated with it.  So if we need to have this informaiton from 
Storefile then i need to change the constructor of StoreFileScanner or use a 
setter.  I thought that was making the patch heavier.  Also in this case the 
information of bulk load or not has to be passed from the reader (because the 
reader reads the file info) and then set that on the Storefile.  Currently 
reader is also an inner class of StoreFile.  Considering all this i just kept 
the new getter/setter in the Reader level. 
bq.compareWithoutMvcc
Okie.  
bq.IMHO we should not do this KeyValueUtil.ensureKeyValue() stuff from now
Yes.. But i think that we should do in a separete JIRA infact to avoid this 
setSeqId but doing KeyValueUtil.ensureKeyValue().
bq.I think we need to set KV seqId for KVs, from bulk loaded file, to the file 
seqId
Yes.. I did set the other KV's sequence id because I wanted to ensure that we 
return one of the KVs from the two of them that are contesting here and ensure 
that we return a KV like what would have been returned if there was no clash 
and the lastest one was from the flushed file.  
Anyway before changing this let me check some more cases.  Then would update 
the patch accordingly.  Infact I had set the sequenceId of the file and later 
changed it to this way.

> Scanner fails to retrieve KV  from bulk loaded file with highest sequence id 
> than the cell's mvcc in a non-bulk loaded file
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-11591
>                 URL: https://issues.apache.org/jira/browse/HBASE-11591
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.99.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.99.0
>
>         Attachments: HBASE-11591.patch, HBASE-11591_1.patch, 
> HBASE-11591_2.patch, TestBulkload.java
>
>
> See discussion in HBASE-11339.
> When we have a case where there are same KVs in two files one produced by 
> flush/compaction and the other thro the bulk load.
> Both the files have some same kvs which matches even in timestamp.
> Steps:
> Add some rows with a specific timestamp and flush the same.  
> Bulk load a file with the same data.. Enusre that "assign seqnum" property is 
> set.
> The bulk load should use HFileOutputFormat2 (or ensure that we write the 
> bulk_time_output key).
> This would ensure that the bulk loaded file has the highest seq num.
> Assume the cell in the flushed/compacted store file is 
> row1,cf,cq,ts1, value1  and the cell in the bulk loaded file is
> row1,cf,cq,ts1,value2 
> (There are no parallel scans).
> Issue a scan on the table in 0.96. The retrieved value is 
> row1,cf1,cq,ts1,value2
> But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1. 
> This is a behaviour change.  This is because of this code 
> {code}
>     public int compare(KeyValueScanner left, KeyValueScanner right) {
>       int comparison = compare(left.peek(), right.peek());
>       if (comparison != 0) {
>         return comparison;
>       } else {
>         // Since both the keys are exactly the same, we break the tie in favor
>         // of the key which came latest.
>         long leftSequenceID = left.getSequenceID();
>         long rightSequenceID = right.getSequenceID();
>         if (leftSequenceID > rightSequenceID) {
>           return -1;
>         } else if (leftSequenceID < rightSequenceID) {
>           return 1;
>         } else {
>           return 0;
>         }
>       }
>     }
> {code}
> Here  in 0.96 case the mvcc of the cell in both the files will have 0 and so 
> the comparison will happen from the else condition .  Where the seq id of the 
> bulk loaded file is greater and would sort out first ensuring that the scan 
> happens from that bulk loaded file.
> In case of 0.98+ as we are retaining the mvcc+seqid we are not making the 
> mvcc as 0 (remains a non zero positive value).  Hence the compare() sorts out 
> the cell in the flushed/compacted file.  Which means though we know the 
> lateset file is the bulk loaded file we don't scan the data.
> Seems to be a behaviour change.  Will check on other corner cases also but we 
> are trying to know the behaviour of bulk load because we are evaluating if it 
> can be used for MOB design.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HBASE-11591) Scanner fails to retrieve KV from bulk loaded file with highest sequence id than the cell's mvcc in a non-bulk loaded file

Reply via email to