[
https://issues.apache.org/jira/browse/HDFS-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-1378:
------------------------------
Attachment: hdfs-1378-branch20.txt
Here's a patch for branch-20, not for commit.
In trunk the code has been refactored a bit so that the edit log loading code
directly gets a DataInputStream, so we can't do it quite the same way. I'd like
to change EditLogInputStream to just return an InputStream rather than
DataInputStream so that we can wrap it in a position tracker as done in this
patch.
Here's example output from an edit log that got corrupted due to the root disk
running out of space:
{noformat}
10/09/06 11:02:30 ERROR common.Storage: Error replaying edit log at offset
1698779
10/09/06 11:02:30 ERROR common.Storage: Last 4 opcodes at offsets: 1629141
1629329 1629546 1698775
10/09/06 11:02:30 ERROR namenode.FSNamesystem: FSNamesystem initialization
failed.
java.io.IOException: Incorrect data format. logVersion is -18 but
writables.length is 0.
{noformat}
>From here it's very easy to use {{bvi}} to figure out where truncation or
>corruption occurred and fix it up.
> Edit log replay should track and report file offsets in case of errors
> ----------------------------------------------------------------------
>
> Key: HDFS-1378
> URL: https://issues.apache.org/jira/browse/HDFS-1378
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: name-node
> Affects Versions: 0.22.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hdfs-1378-branch20.txt
>
>
> Occasionally there are bugs or operational mistakes that result in corrupt
> edit logs which I end up having to repair by hand. In these cases it would be
> very handy to have the error message also print out the file offsets of the
> last several edit log opcodes so it's easier to find the right place to edit
> in the OP_INVALID marker. We could also use this facility to provide a rough
> estimate of how far along edit log replay the NN is during startup (handy
> when a 2NN has died and replay takes a while)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.