[ 
https://issues.apache.org/jira/browse/HDFS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hairong Kuang updated HDFS-903:
-------------------------------

    Attachment: trunkChecksumImage3.patch

This patch made the change in Checkpointer as Konstantin suggested. Actually 
TestBackupNode caught this.

The patch also fixed a subtle bug in TestSaveNameSpace caused by using spy. A 
spyed object does only a shallow copy of the original object. So when a new 
checksum is generated when saving the image to disk, the new value is set in 
the spyImage, but when saving the signature into VERSION file using 
StorageDirectory, it uses the value set in orignalImage. So reloading the image 
would fail. I fixed it by explicitly setting the storage directories in 
spyImage.

> NN should verify images and edit logs on startup
> ------------------------------------------------
>
>                 Key: HDFS-903
>                 URL: https://issues.apache.org/jira/browse/HDFS-903
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>            Reporter: Eli Collins
>            Assignee: Hairong Kuang
>            Priority: Critical
>             Fix For: 0.22.0
>
>         Attachments: trunkChecksumImage.patch, trunkChecksumImage1.patch, 
> trunkChecksumImage2.patch, trunkChecksumImage3.patch
>
>
> I was playing around with corrupting fsimage and edits logs when there are 
> multiple dfs.name.dirs specified. I noticed that:
>  * As long as your corruption does not make the image invalid, eg changes an 
> opcode so it's an invalid opcode HDFS doesn't notice and happily uses a 
> corrupt image or applies the corrupt edit.
> * If the first image in dfs.name.dir is "valid" it replaces the other copies 
> in the other name.dirs, even if they are different, with this first image, ie 
> if the first image is actually invalid/old/corrupt metadata than you've lost 
> your valid metadata, which can result in data loss if the namenode garbage 
> collects blocks that it thinks are no longer used.
> How about we maintain a checksum as part of the image and edit log and check 
> those on startup and refuse to startup if they are different. Or at least 
> provide a configuration option to do so if people are worried about the 
> overhead of maintaining checksums of these files. Even if we assume 
> dfs.name.dir is reliable storage this guards against operator errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to