[ https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119502#comment-14119502 ]
Hudson commented on HBASE-11868: -------------------------------- FAILURE: Integrated in HBase-0.98 #493 (See [https://builds.apache.org/job/HBase-0.98/493/]) HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) (apurtell: rev 39771b8f73a6e6eae12e8b3bdb7dd1fe13edc83c) * hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java > Data loss in hlog when the hdfs is unavailable > ---------------------------------------------- > > Key: HBASE-11868 > URL: https://issues.apache.org/jira/browse/HBASE-11868 > Project: HBase > Issue Type: Bug > Affects Versions: 0.98.5 > Reporter: Liu Shaohui > Assignee: Liu Shaohui > Priority: Blocker > Fix For: 0.98.6 > > Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff > > > When using the new thread model in hbase 0.98, we found a bug which may cause > data loss when the the hdfs is unavailable. > When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog > first call appendNoSync to write the edits to hlog and then call sync with > txid. > Assumed that the txid of current write is 10, and the syncedTillHere in hlog > is 9 and the failedTxid is 0. When the the hdfs is unavailable, the > AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they > will update the syncedTillHere to 10 and the failedTxid to 10. > When the hlog calls the sync with txid :10, the failedTxid will nerver be > checked for txid equals with syncedTillHere. The client thinks the write > success , but the data only be writtten to memstore not hlog. If the > regionserver is down later before the memstore is flushed, the data will be > lost. > See: FSHLog.java #1348 > {code} > // sync all transactions upto the specified txid > private void syncer(long txid) throws IOException { > synchronized (this.syncedTillHere) { > while (this.syncedTillHere.get() < txid) { > try { > this.syncedTillHere.wait(); > if (txid <= this.failedTxid.get()) { > assert asyncIOE != null : > "current txid is among(under) failed txids, but asyncIOE is > null!"; > throw asyncIOE; > } > } catch (InterruptedException e) { > LOG.debug("interrupted while waiting for notification from > AsyncNotifier"); > } > } > } > } > {code} > We can fix this issue by moving the comparing of txid and failedTxid outside > the while block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)