[jira] [Updated] (HDFS-9068) SBN checkpoint could not work after the only name directory recovery from failure

He Xiaoqiao (JIRA) Sun, 13 Sep 2015 21:00:07 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


He Xiaoqiao updated HDFS-9068:
------------------------------
    Attachment: HDFS-9068.patch

Attach patch: check failure directory if OK before saving fsimage.

> SBN checkpoint could not work after the only name directory recovery from 
> failure
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-9068
>                 URL: https://issues.apache.org/jira/browse/HDFS-9068
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.1
>            Reporter: He Xiaoqiao
>         Attachments: HDFS-9068.patch
>
>
> SBN does checkpoint to {{dfs.namenode.name.dir}} peroidly, but the 
> checkpointer could not work when there is only one directory in configuration 
> item {{dfs.namenode.name.dir}} and the disk which the directory located 
> recoveries from failure.
> The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java
> {code:title=org.apache.hadoop.hdfs.server.namenode.FSImage.java|borderStyle=solid}
> @Override
> public void run() {
>   try {
>     saveFSImage(context, sd, nnf);
>   } catch (SaveNamespaceCancelledException snce) {
>     LOG.info("Cancelled image saving for " + sd.getRoot() +
>         ": " + snce.getMessage());
>     // don't report an error on the storage dir!
>   } catch (Throwable t) {
>     LOG.error("Unable to save image for " + sd.getRoot(), t);
>     context.reportErrorOnStorageDirectory(sd);
>   }
> }
> {code}
> sd is added to errorSDs: {{context.reportErrorOnStorageDirectory(sd)}}, it 
> will never be used when {{saveFSImage(context, sd, nnf)}} failed becasue 
> storage is Not available or failed even if it recovers from failure. Then 
> JournalNode will accumulate a large number of editlog files since 
> checkpointer failed and NameNode will restart for log time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-9068) SBN checkpoint could not work after the only name directory recovery from failure

Reply via email to