[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed

Andrew Purtell (JIRA) Wed, 16 Apr 2014 20:39:06 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972269#comment-13972269
 ]


Andrew Purtell commented on HBASE-10958:
----------------------------------------

bq.  The requirement on 'CREATE' for bulk load seems to come from [ 
getTableDescriptors ]. Is this even intended?

Yes.

CREATE is overloaded to mean "RESTRICTED ADMIN", with ADMIN-ish privilege 
required because table schema is considered potentially sensitive. 

On another issue Francis Liu and I discussed the notion of creating a new 
permission 'SCHEMA' which would grant permission to read schema metadata. Now 
as then it seems maybe not quite needed (yet). CREATE and ADMIN would have such 
SCHEMA permission implicitly, so how useful would SCHEMA be, and there would 
still need a grant beyond WRITE for bulk loading.

> [dataloss] Bulk loading with seqids can prevent some log entries from being 
> replayed
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-10958
>                 URL: https://issues.apache.org/jira/browse/HBASE-10958
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.96.2, 0.98.1, 0.94.18
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3
>
>         Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, 
> HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch
>
>
> We found an issue with bulk loads causing data loss when assigning sequence 
> ids (HBASE-6630) that is triggered when replaying recovered edits. We're 
> nicknaming this issue *Blindspot*.
> The problem is that the sequence id given to a bulk loaded file is higher 
> than those of the edits in the region's memstore. When replaying recovered 
> edits, the rule to skip some of them is that they have to be _lower than the 
> highest sequence id_. In other words, the edits that have a sequence id lower 
> than the highest one in the store files *should* have also been flushed. This 
> is not the case with bulk loaded files since we now have an HFile with a 
> sequence id higher than unflushed edits.
> The log recovery code takes this into account by simply skipping the bulk 
> loaded files, but this "bulk loaded status" is *lost* on compaction. The 
> edits in the logs that have a sequence id lower than the bulk loaded file 
> that got compacted are put in a blind spot and are skipped during replay.
> Here's the easiest way to recreate this issue:
>  - Create an empty table
>  - Put one row in it (let's say it gets seqid 1)
>  - Bulk load one file (it gets seqid 2). I used ImporTsv and set 
> hbase.mapreduce.bulkload.assign.sequenceNumbers.
>  - Bulk load a second file the same way (it gets seqid 3).
>  - Major compact the table (the new file has seqid 3 and isn't considered 
> bulk loaded).
>  - Kill the region server that holds the table's region.
>  - Scan the table once the region is made available again. The first row, at 
> seqid 1, will be missing since the HFile with seqid 3 makes us believe that 
> everything that came before it was flushed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed

Reply via email to