[ 
https://issues.apache.org/jira/browse/HBASE-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119502#comment-14119502
 ] 

Hudson commented on HBASE-11868:
--------------------------------

FAILURE: Integrated in HBase-0.98 #493 (See 
[https://builds.apache.org/job/HBase-0.98/493/])
HBASE-11868 Data loss in hlog when the hdfs is unavailable (Liu Shaohui) 
(apurtell: rev 39771b8f73a6e6eae12e8b3bdb7dd1fe13edc83c)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java


> Data loss in hlog when the hdfs is unavailable
> ----------------------------------------------
>
>                 Key: HBASE-11868
>                 URL: https://issues.apache.org/jira/browse/HBASE-11868
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.5
>            Reporter: Liu Shaohui
>            Assignee: Liu Shaohui
>            Priority: Blocker
>             Fix For: 0.98.6
>
>         Attachments: HBASE-11868-0.98-v1.diff, HBASE-11868-0.98-v2.diff
>
>
> When using the new thread model in hbase 0.98, we found a bug which may cause 
> data loss when the the hdfs is unavailable.
> When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog 
> first call appendNoSync to write the edits to hlog and then call sync with 
> txid. 
> Assumed that the txid of current write is 10, and the syncedTillHere in hlog 
> is 9 and the failedTxid is 0. When  the the hdfs is unavailable, the 
> AsyncWriter or AsyncSyncer will fail to apend the edits or sync, then they 
> will update the syncedTillHere to 10 and the failedTxid to 10.
> When the hlog calls the sync with txid :10, the failedTxid will nerver be 
> checked for txid equals with syncedTillHere.  The client thinks the write 
> success , but the data only be writtten to memstore not hlog. If the 
> regionserver is down later before the memstore is flushed, the data will be 
> lost.
> See: FSHLog.java #1348
> {code}
>   // sync all transactions upto the specified txid
>   private void syncer(long txid) throws IOException {
>     synchronized (this.syncedTillHere) {
>       while (this.syncedTillHere.get() < txid) {
>         try {
>           this.syncedTillHere.wait();
>           if (txid <= this.failedTxid.get()) {
>             assert asyncIOE != null :
>               "current txid is among(under) failed txids, but asyncIOE is 
> null!";
>             throw asyncIOE;
>           }
>         } catch (InterruptedException e) {
>           LOG.debug("interrupted while waiting for notification from 
> AsyncNotifier");
>         }
>       }
>     }
>   }
> {code}
> We can fix this issue by moving the comparing of txid and failedTxid outside 
> the while block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to