[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574087#comment-14574087 ]
stack commented on HBASE-13811: ------------------------------- [~Apache9] JIRA was down so it took a while to respond.... bq. Fine, I think it will work. But I still feel a little nervous to have two methods which have same name but different behaviors... Makes sense. In this v7 patch, I made the two overloaded methods work the same and changed what happens in HRegion when we prepare to flush. bq. And I remember that, when implmenting HBASE-10201 and HBASE-12405, actually I wanted to return the flushedSeqId when calling startCacheFlush first. But there are two problems. First is getNextSequenceId method is in HRegion, not in FSHLog, so a simple solution is return NO_SEQ_NUM when flushing all stores and let HRegion call getNextSequenceId. Yes. That is how it 'works' in patch v6 but it is hard to read. We can actually tell when we are flushing if we should do all of the region, right? If the passed in families are null or equal in number to region stores, we are doing a full region flush so we should use the flush sequence id, the result of the getNextSequenceId call. Otherwise, we want the getEarliest for the region because are doing a column family only flush... bq. But here comes the second problem, startCacheFlush may fail which means we can not start a flush, so there are three types of return values, 'sequenceId', 'choose a sequenceId by yourself', 'give up flushing!'. I think it is ugly to have a '-2' or a null java.lang.Long to indicate a 'give up flushing' at that time so I gave up... Pardon me, I don't see the problem here? Your nice TestSplitWalDataLoss test was failing for me earlier because I was not doing the abort accounting properly; the 'restore' of old sequenceids. Abort of the flush will 'restore' the old sequenceids. The region flush id won't be updated. This is ok? bq. Maybe we could consider this solution again? getEarliestMemstoreSeqNum can be used everywhere but startCacheFlush is restricted in the flushing scope I think. I'd like to purge getEarliestMemstoreSeqNum or narrow its usage if possible. What do you mean by 'startCacheFlush is restricted'. Thanks Duo > Splitting WALs, we are filtering out too many edits -> DATALOSS > --------------------------------------------------------------- > > Key: HBASE-13811 > URL: https://issues.apache.org/jira/browse/HBASE-13811 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 2.0.0, 1.2.0 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 2.0.0, 1.2.0 > > Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, > 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, > 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, > 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch, > HBASE-13811.testcase.patch > > > I've been running ITBLLs against branch-1 around HBASE-13616 (move of > ServerShutdownHandler to pv2). I have come across an instance of dataloss. My > patch for HBASE-13616 was in place so can only think it the cause (but cannot > see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)