[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits -> DATALOSS

stack (JIRA) Fri, 05 Jun 2015 00:51:50 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574087#comment-14574087
 ]


stack commented on HBASE-13811:
-------------------------------

[~Apache9] JIRA was down so it took a while to respond....


bq. Fine, I think it will work. But I still feel a little nervous to have two 
methods which have same name but different behaviors...

Makes sense. In this v7 patch, I made the two overloaded methods work the same 
and changed what happens in HRegion when we prepare to flush.

bq. And I remember that, when implmenting HBASE-10201 and HBASE-12405, actually 
I wanted to return the flushedSeqId when calling startCacheFlush first. But 
there are two problems. First is getNextSequenceId method is in HRegion, not in 
FSHLog, so a simple solution is return NO_SEQ_NUM when flushing all stores and 
let HRegion call getNextSequenceId. 

Yes. That is how it 'works' in patch v6 but it is hard to read. We can actually 
tell when we are flushing if we should do all of the region, right? If the 
passed in families are null or equal in number to region stores, we are doing a 
full region flush so we should use the flush sequence id, the result of the 
getNextSequenceId call. Otherwise, we want the getEarliest for the region 
because are doing a column family only flush...


bq. But here comes the second problem, startCacheFlush may fail which means we 
can not start a flush, so there are three types of return values, 'sequenceId', 
'choose a sequenceId by yourself', 'give up flushing!'. I think it is ugly to 
have a '-2' or a null java.lang.Long to indicate a 'give up flushing' at that 
time so I gave up...

Pardon me, I don't see the problem here? Your nice TestSplitWalDataLoss test 
was failing for me earlier because I was not doing the abort accounting 
properly; the 'restore' of old sequenceids. Abort of the flush will 'restore' 
the old sequenceids. The region flush id won't be updated. This is ok?

bq. Maybe we could consider this solution again? getEarliestMemstoreSeqNum can 
be used everywhere but startCacheFlush is restricted in the flushing scope I 
think.

I'd like to purge getEarliestMemstoreSeqNum or narrow its usage if possible.  
What do you mean by 'startCacheFlush is restricted'.

Thanks Duo

> Splitting WALs, we are filtering out too many edits -> DATALOSS
> ---------------------------------------------------------------
>
>                 Key: HBASE-13811
>                 URL: https://issues.apache.org/jira/browse/HBASE-13811
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 2.0.0, 1.2.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0, 1.2.0
>
>         Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
> 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 
> 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 
> 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch, 
> HBASE-13811.testcase.patch
>
>
> I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
> ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
> patch for HBASE-13616 was in place so can only think it the cause (but cannot 
> see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits -> DATALOSS

Reply via email to