[
https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119446#comment-14119446
]
Hudson commented on HBASE-11868:
--------------------------------
FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #465 (See
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/465/])
Revert "HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu
Shaohui)" (apurtell: rev ee32706c5d93fb3de6f4aba09174d34ca3879f6d)
*
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> Data loss in hlog when the hdfs is unavailable
> ----------------------------------------------
>
> Key: HBASE-11868
> URL: https://issues.apache.org/jira/browse/HBASE-11868
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.5
> Reporter: Liu Shaohui
> Assignee: Liu Shaohui
> Priority: Blocker
> Fix For: 0.98.6
>
> Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff
>
>
> When using the new thread model in hbase 0.98, we found a bug which may cause
> data loss when the the hdfs is unavailable.
> When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog
> first call appendNoSync to write the edits to hlog and then call sync with
> txid.
> Assumed that the txid of current write is 10, and the syncedTillHere in hlog
> is 9 and the failedTxid is 0. When the the hdfs is unavailable, the
> AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they
> will update the syncedTillHere to 10 and the failedTxid to 10.
> When the hlog calls the sync with txid :10, the failedTxid will nerver be
> checked for txid equals with syncedTillHere. The client thinks the write
> success , but the data only be writtten to memstore not hlog. If the
> regionserver is down later before the memstore is flushed, the data will be
> lost.
> See: FSHLog.java #1348
> {code}
> // sync all transactions upto the specified txid
> private void syncer(long txid) throws IOException {
> synchronized (this.syncedTillHere) {
> while (this.syncedTillHere.get() < txid) {
> try {
> this.syncedTillHere.wait();
> if (txid <= this.failedTxid.get()) {
> assert asyncIOE != null :
> "current txid is among(under) failed txids, but asyncIOE is
> null!";
> throw asyncIOE;
> }
> } catch (InterruptedException e) {
> LOG.debug("interrupted while waiting for notification from
> AsyncNotifier");
> }
> }
> }
> }
> {code}
> We can fix this issue by moving the comparing of txid and failedTxid outside
> the while block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)