[
https://issues.apache.org/jira/browse/HDFS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Patrick McCabe updated HDFS-3004:
---------------------------------------
Attachment: HDFS-3004.patch
* Add 'a' option to interactive mode (always take first choice)
* More helpful printout when truncating the edit log
> Create Offline NameNode recovery tool
> -------------------------------------
>
> Key: HDFS-3004
> URL: https://issues.apache.org/jira/browse/HDFS-3004
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: tools
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-3004.patch, HDFS-3004__namenode_recovery_tool.txt
>
>
> We've been talking about creating a tool which can process NameNode edit logs
> and image files offline.
> This tool would be similar to a fsck for a conventional filesystem. It would
> detect inconsistencies and malformed data. In cases where it was possible,
> and the operator asked for it, it would try to correct the inconsistency.
> It's probably better to call this "nameNodeRecovery" or similar, rather than
> "fsck," since we already have a separate and unrelated mechanism which we
> refer to as fsck.
> The use case here is that the NameNode data is corrupt for some reason, and
> we want to fix it. Obviously, we would prefer never to get in this case. In
> a perfect world, we never would. However, bad data on disk can happen from
> time to time, because of hardware errors or misconfigurations. In the past
> we have had to correct it manually, which is time-consuming and which can
> result in downtime.
> I would like to reuse as much code as possible from the NameNode in this
> tool. Hopefully, the effort that is spent developing this will also make the
> NameNode editLog and image processing even more robust than it already is.
> Another approach that we have discussed is NOT having an offline tool, but
> just having a switch supplied to the NameNode, like "—auto-fix" or
> "—force-fix". In that case, the NameNode would attempt to "guess" when data
> was missing or incomplete in the EditLog or Image-- rather than aborting as
> it does now. Like the proposed fsck tool, this switch could be used to get
> users back on their feet quickly after a problem developed. I am not in
> favor of this approach, because there is a danger that users could supply
> this flag in cases where it is not appropriate. This risk does not exist for
> an offline fsck tool, since it would have to be run explicitly. However, I
> wanted to mention this proposal here for completeness.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira