[
https://issues.apache.org/jira/browse/HBASE-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080547#comment-14080547
]
ramkrishna.s.vasudevan commented on HBASE-11591:
------------------------------------------------
Not sure on other test cases failures but the new test case added
TestScannerWithBulkLoad fails here
{code}
protected void checkScanOrder(Cell prevKV, Cell kv,
KeyValue.KVComparator comparator) throws IOException {
// Check that the heap gives us KVs in an increasing order.
assert prevKV == null || comparator == null
|| comparator.compare(prevKV, kv) <= 0 : "Key " + prevKV
+ " followed by a " + "smaller key " + kv + " in cf " + store;
}
{code}
So can we remove that assertion? This change is becoming trickier.
> Scanner fails to retrieve KV from bulk loaded file with highest sequence id
> than the cell's mvcc in a non-bulk loaded file
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-11591
> URL: https://issues.apache.org/jira/browse/HBASE-11591
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.99.0
> Reporter: ramkrishna.s.vasudevan
> Assignee: ramkrishna.s.vasudevan
> Priority: Critical
> Fix For: 0.99.0
>
> Attachments: HBASE-11591.patch, TestBulkload.java
>
>
> See discussion in HBASE-11339.
> When we have a case where there are same KVs in two files one produced by
> flush/compaction and the other thro the bulk load.
> Both the files have some same kvs which matches even in timestamp.
> Steps:
> Add some rows with a specific timestamp and flush the same.
> Bulk load a file with the same data.. Enusre that "assign seqnum" property is
> set.
> The bulk load should use HFileOutputFormat2 (or ensure that we write the
> bulk_time_output key).
> This would ensure that the bulk loaded file has the highest seq num.
> Assume the cell in the flushed/compacted store file is
> row1,cf,cq,ts1, value1 and the cell in the bulk loaded file is
> row1,cf,cq,ts1,value2
> (There are no parallel scans).
> Issue a scan on the table in 0.96. The retrieved value is
> row1,cf1,cq,ts1,value2
> But the same in 0.98 will retrieve row1,cf1,cq,ts2,value1.
> This is a behaviour change. This is because of this code
> {code}
> public int compare(KeyValueScanner left, KeyValueScanner right) {
> int comparison = compare(left.peek(), right.peek());
> if (comparison != 0) {
> return comparison;
> } else {
> // Since both the keys are exactly the same, we break the tie in favor
> // of the key which came latest.
> long leftSequenceID = left.getSequenceID();
> long rightSequenceID = right.getSequenceID();
> if (leftSequenceID > rightSequenceID) {
> return -1;
> } else if (leftSequenceID < rightSequenceID) {
> return 1;
> } else {
> return 0;
> }
> }
> }
> {code}
> Here in 0.96 case the mvcc of the cell in both the files will have 0 and so
> the comparison will happen from the else condition . Where the seq id of the
> bulk loaded file is greater and would sort out first ensuring that the scan
> happens from that bulk loaded file.
> In case of 0.98+ as we are retaining the mvcc+seqid we are not making the
> mvcc as 0 (remains a non zero positive value). Hence the compare() sorts out
> the cell in the flushed/compacted file. Which means though we know the
> lateset file is the bulk loaded file we don't scan the data.
> Seems to be a behaviour change. Will check on other corner cases also but we
> are trying to know the behaviour of bulk load because we are evaluating if it
> can be used for MOB design.
--
This message was sent by Atlassian JIRA
(v6.2#6252)