[ 
https://issues.apache.org/jira/browse/HBASE-11772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122520#comment-14122520
 ] 

Anoop Sam John commented on HBASE-11772:
----------------------------------------

In your eg: you were telling "hfile named 'abc_SeqId_10_' can exist in HBase"
But I got it now.  The file can be already a bulk loaded one and so seqId_ part 
already in name. Again when u load this to another table, second seqId_ part 
can come in. So yes, for getting the actual seq_id of the file, go with 
lastIndexOf() and there is no issue with this check to know whether this is a 
bulk loaded one or not.
{code}
+    String fileName = this.getPath().getName();
+    int startPos = fileName.indexOf("SeqId_");
+    if (startPos != -1) {
+      bulkLoadedHFile = true;
+    }{code}
This is what ur thinking?

[~tedyu] was asking why the change to lastIndexOf().


> Bulk load mvcc and seqId issues with native hfiles
> --------------------------------------------------
>
>                 Key: HBASE-11772
>                 URL: https://issues.apache.org/jira/browse/HBASE-11772
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.5
>            Reporter: Jerry He
>            Assignee: Jerry He
>            Priority: Critical
>             Fix For: 0.99.0, 1.0.0, 2.0.0, 0.98.7
>
>         Attachments: HBASE-11772-0.98.patch, HBASE-11772-master-v1.patch
>
>
> There are mvcc and seqId issues when bulk load native hfiles -- meaning 
> hfiles that are direct file copy-out from hbase, not from HFileOutputFormat 
> job.
> There are differences between these two types of hfiles.
> Native hfiles have possible non-zero MAX_MEMSTORE_TS_KEY value and non-zero 
> mvcc values in cells. 
> Native hfiles also have MAX_SEQ_ID_KEY.
> Native hfiles do not have BULKLOAD_TIME_KEY.
> Here are a couple of problems I observed when bulk load native hfiles.
> 1.  Cells in newly bulk loaded hfiles can be invisible to scan.
> It is easy to re-create.
> Bulk load a native hfile that has a larger mvcc value in cells, e.g 10
> If the current readpoint when initiating a scan is less than 10, the cells in 
> the new hfile are skipped, thus become invisible.
> We don't reset the readpoint of a region after bulk load.
> 2. The current StoreFile.isBulkLoadResult() is implemented as:
> {code}
> return metadataMap.containsKey(BULKLOAD_TIME_KEY)
> {code}
> which does not detect bulkloaded native hfiles.
> 3. Another observed problem is possible data loss during log recovery. 
> It is similar to HBASE-10958 reported by [~jdcryans]. Borrow the re-create 
> steps from HBASE-10958.
> 1) Create an empty table
> 2) Put one row in it (let's say it gets seqid 1)
> 3) Bulk load one native hfile with large seqId ( e.g. 100). The native hfile 
> can be obtained by copying out from existing table.
> 4) Kill the region server that holds the table's region.
> Scan the table once the region is made available again. The first row, at 
> seqid 1, will be missing since the HFile with seqid 100 makes us believe that 
> everything that came before it was flushed. 
> The problem 3 is probably related to 2. We will be ok if we get the appended 
> seqId during bulk load instead of 100 from inside the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to