[
https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Liu Shaohui updated HBASE-11868:
--------------------------------
Status: Patch Available (was: Open)
> Data loss in hlog when the hdfs is unavailable
> ----------------------------------------------
>
> Key: HBASE-11868
> URL: https://issues.apache.org/jira/browse/HBASE-11868
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.5
> Reporter: Liu Shaohui
> Assignee: Liu Shaohui
> Priority: Blocker
> Attachments: HBASE-11868-0.98-v1.diff
>
>
> When using the new thread model in hbase, we found a bug which may cause data
> loss when the the hdfs is unavailable.
> When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog
> first call appendNoSync to write the edits to hlog and then call sync with
> txid.
> Assumed that the txid of current write is 10, and the syncedTillHere in hlog
> is 9 and the failedTxid is 0. When the the hdfs is unavailable, the
> AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they
> will update the syncedTillHere to 10 and the failedTxid to 10.
> When the hlog calls the sync with txid :10, the failedTxid will nerver be
> checked for txid is less than syncedTillHere. The client thinks the write
> success , but the data only be writtten to memstore not hlog. If the
> regionserver is down later before the memstore if flushed, the data will be
> lost.
> {code}
> // sync all transactions upto the specified txid
> private void syncer(long txid) throws IOException {
> synchronized (this.syncedTillHere) {
> while (this.syncedTillHere.get() < txid) {
> try {
> this.syncedTillHere.wait();
> if (txid <= this.failedTxid.get()) {
> assert asyncIOE != null :
> "current txid is among(under) failed txids, but asyncIOE is
> null!";
> throw asyncIOE;
> }
> } catch (InterruptedException e) {
> LOG.debug("interrupted while waiting for notification from
> AsyncNotifier");
> }
> }
> }
> }
> {code}
> We can fix this issue by moving the comparing of txid and failedTxid outside
> the while block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)