[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834096#action_12834096 ]
Todd Lipcon commented on HDFS-909: ---------------------------------- I thought about this a bit more deeply over dinner :) I think the potential issue is not in fact a deadlock, since waitForAllOngoingEdits is a bit misnamed. In fact, it just calls through to logSync, so in the "bad code" example above, what will happen is that we sync the first synchronized block's edit for it, and then enter safe mode before entering the second synchronized block. So, to be correct, code just needs to make sure that it always checks isInSafeMode() each time it enters synchronized (this) {...}, before making any edit. Dhruba, do you concur? Switching to the ReadWriteLock would be nice since we could enforce the safemode check as part of the lock acquisition. But that's definitely out of scope. So, I think this is a certain improvement with no deadlock potential. The error case described above is no worse than what currently happens. So, I think the patch is good, and we'll use HDFS-956 to fix the "always check safe mode after entering synchronized blocks" problems. > Race condition between rollEditLog or rollFSImage ant FSEditsLog.write > operations corrupts edits log > ----------------------------------------------------------------------------------------------------- > > Key: HDFS-909 > URL: https://issues.apache.org/jira/browse/HDFS-909 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 > Environment: CentOS > Reporter: Cosmin Lehene > Assignee: Todd Lipcon > Priority: Blocker > Fix For: 0.21.0, 0.22.0 > > Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, > hdfs-909.txt, hdfs-909.txt > > > closing the edits log file can race with write to edits log file operation > resulting in OP_INVALID end-of-file marker being initially overwritten by the > concurrent (in setReadyToFlush) threads and then removed twice from the > buffer, losing a good byte from edits log. > Example: > {code} > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> > FSEditLog.closeStream() -> EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > OR > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush() > FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> > FSEditLog.purgeEditLog() -> FSEditLog.revertFileStreams() -> > FSEditLog.closeStream() ->EditLogOutputStream.flush() -> > EditLogFileOutputStream.flushAndSync() > VERSUS > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.setReadyToFlush() > FSNameSystem.completeFile -> FSEditLog.logSync() -> > EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync() > OR > Any FSEditLog.write > {code} > Access on the edits flush operations is synchronized only in the > FSEdits.logSync() method level. However at a lower level access to > EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT > synchronized. These can be called from concurrent threads like in the example > above > So if a rollEditLog or rollFSIMage is happening at the same time with a write > operation it can race for EditLogFileOutputStream.setReadyToFlush that will > overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the > "end-of-file marker") and then remove it twice (from each thread) in > flushAndSync()! Hence there will be a valid byte missing from the edits log > that leads to a SecondaryNameNode silent failure and a full HDFS failure upon > cluster restart. > We got to this point after investigating a corrupted edits file that made > HDFS unable to start with > {code:title=namenode.log} > java.io.IOException: Incorrect data format. logVersion is -20 but > writables.length is 768. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 > {code} > EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.