[
https://issues.apache.org/jira/browse/HDFS-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991339#comment-15991339
]
Kihwal Lee commented on HDFS-11714:
-----------------------------------
bq. What if a VERSION file already exists in the directory for some reason?
Should we at least print a WARN for further investigation?
The equivalent code for non-HA case (saveNamespace) also unconditionally
overwrites existing VERSION. The reasoning is, regardless of previous state,
now it has the up-to-date checkpoint, so it should have an accompanying VERSION
file. So it is expected to overwrite if a VERSION already exists. I don't
think we need to do anything here.
bq. On the retention manager, is it the right behavior to skip purging old
image files if VERSION is missing? Should we do a follow-on fix to handle the
case where the VERSION file is lost for some other reasons (mis-operaiton etc.)?
At minimum, it already logs a WARN. What do you think should be done? Report a
storage error by calling {{reportErrorsOnDirectory()}}? This will cause the
storage dir to be in the "failed" list, which will be recovered later online.
The recovery check should be made to check for existence of VERSION then.
> Newly added NN storage directory won't get initialized and cause space
> exhaustion
> ---------------------------------------------------------------------------------
>
> Key: HDFS-11714
> URL: https://issues.apache.org/jira/browse/HDFS-11714
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.3
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Critical
> Attachments: HDFS-11714.trunk.patch, HDFS-11714.v2.branch-2.patch,
> HDFS-11714.v2.trunk.patch
>
>
> When an empty namenode storage directory is detected on normal NN startup, it
> may not be fully initialized. The new directory is still part of "in-service"
> NNStrage and when a checkpoint image is uploaded, a copy will also be written
> there. However, the retention manager won't be able to purge old files since
> it is lacking a VERSION file. This causes fsimages to pile up in the
> directory. With a big name space, the disk will be filled in the order of
> days or weeks.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]