[
https://issues.apache.org/jira/browse/HDFS-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
He Xiaoqiao updated HDFS-9068:
------------------------------
Attachment: HDFS-9068.patch
Attach patch: check failure directory if OK before saving fsimage.
> SBN checkpoint could not work after the only name directory recovery from
> failure
> ---------------------------------------------------------------------------------
>
> Key: HDFS-9068
> URL: https://issues.apache.org/jira/browse/HDFS-9068
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.4.1
> Reporter: He Xiaoqiao
> Attachments: HDFS-9068.patch
>
>
> SBN does checkpoint to {{dfs.namenode.name.dir}} peroidly, but the
> checkpointer could not work when there is only one directory in configuration
> item {{dfs.namenode.name.dir}} and the disk which the directory located
> recoveries from failure.
> The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java
> {code:title=org.apache.hadoop.hdfs.server.namenode.FSImage.java|borderStyle=solid}
> @Override
> public void run() {
> try {
> saveFSImage(context, sd, nnf);
> } catch (SaveNamespaceCancelledException snce) {
> LOG.info("Cancelled image saving for " + sd.getRoot() +
> ": " + snce.getMessage());
> // don't report an error on the storage dir!
> } catch (Throwable t) {
> LOG.error("Unable to save image for " + sd.getRoot(), t);
> context.reportErrorOnStorageDirectory(sd);
> }
> }
> {code}
> sd is added to errorSDs: {{context.reportErrorOnStorageDirectory(sd)}}, it
> will never be used when {{saveFSImage(context, sd, nnf)}} failed becasue
> storage is Not available or failed even if it recovers from failure. Then
> JournalNode will accumulate a large number of editlog files since
> checkpointer failed and NameNode will restart for log time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)