[
https://issues.apache.org/jira/browse/HDFS-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Suresh Srinivas updated HDFS-3044:
----------------------------------
Target Version/s: 1.1.0
> fsck move should be non-destructive by default
> ----------------------------------------------
>
> Key: HDFS-3044
> URL: https://issues.apache.org/jira/browse/HDFS-3044
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: name-node
> Reporter: Eli Collins
> Assignee: Colin Patrick McCabe
> Fix For: 1.1.0, 2.0.0-alpha
>
> Attachments: HDFS-3044.002.patch, HDFS-3044.003.patch,
> HDFS-3044-b1.002.patch, HDFS-3044-b1.004.patch
>
>
> The fsck move behavior in the code and originally articulated in HADOOP-101
> is:
> {quote}Current failure modes for DFS involve blocks that are completely
> missing. The only way to "fix" them would be to recover chains of blocks and
> put them into lost+found{quote}
> A directory is created with the file name, the blocks that are accessible are
> created as individual files in this directory, then the original file is
> removed.
> I suspect the rationale for this behavior was that you can't use files that
> are missing locations, and copying the block as files at least makes part of
> the files accessible. However this behavior can also result in permanent
> dataloss. Eg:
> - Some datanodes don't come up (eg due to a HW issues) and checkin on cluster
> startup, files with blocks where all replicas are on these set of datanodes
> are marked corrupt
> - Admin does fsck move, which deletes the "corrupt" files, saves whatever
> blocks were available
> - The HW issues with datanodes are resolved, they are started and join the
> cluster. The NN tells them to delete their blocks for the corrupt files since
> the file was deleted.
> I think we should:
> - Make fsck move non-destructive by default (eg just does a move into
> lost+found)
> - Make the destructive behavior optional (eg "--destructive" so admins think
> about what they're doing)
> - Provide better sanity checks and warnings, eg if you're running fsck and
> not all the slaves have checked in (if using dfs.hosts) then fsck should
> print a warning indicating this that an admin should have to override if they
> want to do something destructive
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira