[ 
https://issues.apache.org/jira/browse/HBASE-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13567496#comment-13567496
 ] 

Anoop Sam John commented on HBASE-7728:
---------------------------------------

LogRoller thread trying to do a rolling over current log file. It captured the 
updateLock already.
{code}
HLog#rollWriter(boolean force)
synchronized (updateLock) {
        // Clean up current writer.
        Path oldFile = cleanupCurrentWriter(currentFilenum);
        this.writer = nextWriter;
                ....
}
{code}
As part of the clean up current writer, this thread try to sync the pending 
writes
{code}
HLog#cleanupCurrentWriter(){
....
        sync();
    }
    this.writer.close();
}
{code}
At the same time logSyncer thread was doing a defered log sync operation
{code}
HLog#syncer(long txid){
 ...
 synchronized (flushLock) {
        ....
        try {
          logSyncerThread.hlogFlush(tempWriter, pending);
        } catch(IOException io) {
          synchronized (this.updateLock) {
                // HBASE-4387, HBASE-5623, retry with updateLock held
                tempWriter = this.writer;
                logSyncerThread.hlogFlush(tempWriter, pending);
          }
        }
}
{code}
This thread trying to grab the updateLock and holding the flushLock. Same time 
the roller thread coming and as part of clean up sync it tries to grab 
flushLock.
IOException might have happened in the logSyncer 
thread(logSyncerThread.hlogFlush). At this time our assumption is a log 
rollover already happened. That is why we try to write again with updateLock 
held and getting the writer again. [The writer on which the IOE happened should 
have closed.]

In roller thread the writer close happens after the cleanup operation.
So I guess logSyncerThread.hlogFlush thrown IOE not because of a log roll.
With out assuming the log roll in catch block we can check for tempWriter == 
this.writer; ??

I am not an expert in this area. As per a quick code study adding my 
observation. If wrong pls correct me.  Any logs with you when this happened?
                
> deadlock occurs between hlog roller and hlog syncer
> ---------------------------------------------------
>
>                 Key: HBASE-7728
>                 URL: https://issues.apache.org/jira/browse/HBASE-7728
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 0.94.2
>         Environment: Linux 2.6.18-164.el5 x86_64 GNU/Linux
>            Reporter: Wang Qiang
>            Priority: Blocker
>
> the hlog roller thread and hlog syncer thread may occur dead lock with the 
> 'flushLock' and 'updateLock', and then cause all 'IPC Server handler' thread 
> blocked on hlog append. the jstack info is as follow :
> "regionserver60020.logRoller":
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1305)
>         - waiting to lock <0x000000067bf88d58> (a java.lang.Object)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1283)
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1456)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:876)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:657)
>         - locked <0x000000067d54ace0> (a java.lang.Object)
>         at 
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
>         at java.lang.Thread.run(Thread.java:662)
> "regionserver60020.logSyncer":
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1314)
>         - waiting to lock <0x000000067d54ace0> (a java.lang.Object)
>         - locked <0x000000067bf88d58> (a java.lang.Object)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1283)
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1456)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:1235)
>         at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to