Andrew Wang created HDFS-4596: --------------------------------- Summary: Shutting down namenode during checkpointing can lead to md5sum error Key: HDFS-4596 URL: https://issues.apache.org/jira/browse/HDFS-4596 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0 Reporter: Andrew Wang Assignee: Andrew Wang Fix For: 2.0.4-alpha
This is a really rare error that can hit if a NN shutdown happens during the checkpointing process. Checkpointing and restarting nominally looks like this: # FSImage is written to a tmp file and then renamed # MD5 file is written to a tmp file and then renamed # NN is killed and restarted # NN scans storage directories and picks up the renamed image file # NN validates that the image file matches its md5 file If the NN is killed before step 2 completes, this is what happens: # FSImage is written to a tmp file and then renamed # NN is killed and restarted (no MD5 file!) # NN scans storage directories and picks up the renamed image file # Since there's no matching MD5 file, NN errors out with a checksum error I think we can fix this by inverting the order of writing the image then md5, or inverting the order of reading the image then md5. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira