[
https://issues.apache.org/jira/browse/HDFS-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742366#comment-14742366
]
He Xiaoqiao commented on HDFS-9068:
-----------------------------------
it is necessary to add thread to periodic check if the sd in {{errorSDs}} is
available, then remove from {{errorSDs}} and add to {{storageDirs}}.
> SBN checkpoint could not work after the only name directory recovery from
> failure
> ---------------------------------------------------------------------------------
>
> Key: HDFS-9068
> URL: https://issues.apache.org/jira/browse/HDFS-9068
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.4.1
> Reporter: He Xiaoqiao
>
> SBN does checkpoint to {{dfs.namenode.name.dir}} peroidly, but the
> checkpointer could not work when there is only one directory in configuration
> item {{dfs.namenode.name.dir}} and the disk which the directory located
> recoveries from failure.
> The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java
> {code:title=org.apache.hadoop.hdfs.server.namenode.FSImage.java|borderStyle=solid}
> @Override
> public void run() {
> try {
> saveFSImage(context, sd, nnf);
> } catch (SaveNamespaceCancelledException snce) {
> LOG.info("Cancelled image saving for " + sd.getRoot() +
> ": " + snce.getMessage());
> // don't report an error on the storage dir!
> } catch (Throwable t) {
> LOG.error("Unable to save image for " + sd.getRoot(), t);
> context.reportErrorOnStorageDirectory(sd);
> }
> }
> {code}
> sd is added to errorSDs: {{context.reportErrorOnStorageDirectory(sd)}}, it
> will never be used when {{saveFSImage(context, sd, nnf)}} failed becasue
> storage is Not available or failed even if it recovers from failure. Then
> JournalNode will accumulate a large number of editlog files since
> checkpointer failed and NameNode will restart for log time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)