[jira] [Created] (HBASE-16820) BulkLoad mvcc visibility only works accidentally

Enis Soztutar (JIRA) Wed, 12 Oct 2016 17:09:49 -0700

Enis Soztutar created HBASE-16820:
-------------------------------------

             Summary: BulkLoad mvcc visibility only works accidentally 
                 Key: HBASE-16820
                 URL: https://issues.apache.org/jira/browse/HBASE-16820
             Project: HBase
          Issue Type: Bug
            Reporter: Enis Soztutar



[~sergey.soldatov] has been debugging an issue with a 1.1 code base where the 
commit for HBASE-16721 broke the bulk load visibility. After bulk load, the 
bulk load files is not visible because the sequence id assigned to the bulk 
load is not advanced in mvcc. 

Debugging further, we have noticed that bulk load behavior is wrong, but it 
works "accidentally" in all code bases (but broken in 1.1 after HBASE-16721). 
Let me explain: 
 - BL request can optionally request a flush before hand (this should be the 
default) which causes the flush to happen with some sequenceId. The flush 
sequence id is one past all the cells' sequenceids. This flush sequence id is 
returned as a result to the flush operation. 
 - BL then uses this particular sequenceId to mark the files, but itself does 
not get a new sequenceid of its own, or advance the mvcc number. 
 - BL completes WITHOUT making sure that the sequence id is visible. 
 - BL itself though writes entries to the WAL for the BL event, which in 1.2 
code bases goes through the whole mvcc + seqId paths, which makes sure that 
earlier sequenceIds (the flush sequenceId) are visible via mvcc. 

The problem with 1.1 is that the WAL entries only get sequence ids, but do not 
touch mvcc. With the patch for HBASE-16721, we have made it so that the 
flushedSequenceId is not used in mvcc as the highest read point (although all 
the data is still visible).

BL relying on the flush sequence id is wrong for two reasons: 
 - BL files are loaded with the flush sequence id from the memstore. This 
particular sequence id is used twice for two different things and ends up being 
the sequence id for flushed file as well as BL'ed files. 
 - BL should make sure that it gets a new sequence id and that sequence id is 
visible before returning the results. 

[~ndimiduk] FYI. 
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-16820) BulkLoad mvcc visibility only works accidentally

Reply via email to