[ https://issues.apache.org/jira/browse/HBASE-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13567496#comment-13567496 ]
Anoop Sam John commented on HBASE-7728: --------------------------------------- LogRoller thread trying to do a rolling over current log file. It captured the updateLock already. {code} HLog#rollWriter(boolean force) synchronized (updateLock) { // Clean up current writer. Path oldFile = cleanupCurrentWriter(currentFilenum); this.writer = nextWriter; .... } {code} As part of the clean up current writer, this thread try to sync the pending writes {code} HLog#cleanupCurrentWriter(){ .... sync(); } this.writer.close(); } {code} At the same time logSyncer thread was doing a defered log sync operation {code} HLog#syncer(long txid){ ... synchronized (flushLock) { .... try { logSyncerThread.hlogFlush(tempWriter, pending); } catch(IOException io) { synchronized (this.updateLock) { // HBASE-4387, HBASE-5623, retry with updateLock held tempWriter = this.writer; logSyncerThread.hlogFlush(tempWriter, pending); } } } {code} This thread trying to grab the updateLock and holding the flushLock. Same time the roller thread coming and as part of clean up sync it tries to grab flushLock. IOException might have happened in the logSyncer thread(logSyncerThread.hlogFlush). At this time our assumption is a log rollover already happened. That is why we try to write again with updateLock held and getting the writer again. [The writer on which the IOE happened should have closed.] In roller thread the writer close happens after the cleanup operation. So I guess logSyncerThread.hlogFlush thrown IOE not because of a log roll. With out assuming the log roll in catch block we can check for tempWriter == this.writer; ?? I am not an expert in this area. As per a quick code study adding my observation. If wrong pls correct me. Any logs with you when this happened? > deadlock occurs between hlog roller and hlog syncer > --------------------------------------------------- > > Key: HBASE-7728 > URL: https://issues.apache.org/jira/browse/HBASE-7728 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 0.94.2 > Environment: Linux 2.6.18-164.el5 x86_64 GNU/Linux > Reporter: Wang Qiang > Priority: Blocker > > the hlog roller thread and hlog syncer thread may occur dead lock with the > 'flushLock' and 'updateLock', and then cause all 'IPC Server handler' thread > blocked on hlog append. the jstack info is as follow : > "regionserver60020.logRoller": > at > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1305) > - waiting to lock <0x000000067bf88d58> (a java.lang.Object) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1283) > at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1456) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:876) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:657) > - locked <0x000000067d54ace0> (a java.lang.Object) > at > org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94) > at java.lang.Thread.run(Thread.java:662) > "regionserver60020.logSyncer": > at > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1314) > - waiting to lock <0x000000067d54ace0> (a java.lang.Object) > - locked <0x000000067bf88d58> (a java.lang.Object) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1283) > at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1456) > at > org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1235) > at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira