[ 
https://issues.apache.org/jira/browse/HDFS-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575833#comment-16575833
 ] 

Yongjun Zhang commented on HDFS-13031:
--------------------------------------

Thanks [~adam.antal] and [~smeng].

Good summary!

The OIV tool may do things differently than NN itself, and using NN to load 
fsimage to verify is the real full checking of the fsimage (what I proposed in 
this jira). But I agree that if feasible, add --verify to OIV could detect the 
problems we have seen so far. Or we can even call it --detectcorruption.

That said, action (quit SNN etc) need to be taken after detecting fsimage 
corruption. I think HDFS-13314 and HDFS-13813 are good complementary solution.

 

 

 

 

> To detect fsimage corruption on the spot
> ----------------------------------------
>
>                 Key: HDFS-13031
>                 URL: https://issues.apache.org/jira/browse/HDFS-13031
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>         Environment:  
>            Reporter: Yongjun Zhang
>            Assignee: Adam Antal
>            Priority: Major
>
> Since we fixed HDFS-9406, there are new cases reported from the field that 
> similar fsimage corruption happens. We need good fsimage + editlogs to replay 
> to reproduce the corruption. However, usually when the corruption is detected 
> (at later NN restart), the good fsimage is already deleted.
> We need to have a way to detect fsimage corruption on the spot. Currently 
> what I think we could do is:
>  # after SNN creates a new fsimage, it spawn a new modified NN process (NN 
> with some new command line args) to just load the fsimage and do nothing 
> else. 
>  # If the process failed, the currently running SNN will do either a) backup 
> the fsimage + editlogs or b) no longer do checkpointing. And it need to 
> somehow raise a flag to user that the fsimage is corrupt.
> In step 2, if we do a, we need to introduce new NN->JN API to backup 
> editlogs; if we do b, it changes SNN's behavior, and kind of not compatible. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to