[
https://issues.apache.org/jira/browse/HDFS-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288038#comment-13288038
]
Colin Patrick McCabe commented on HDFS-3049:
--------------------------------------------
bq. not sure of the logic for EOF: let's say I have two streams, one is tx
1-15, and the other is 1-20. When we sort, they'll be in the order (1-20,
1-15). I then encounter an error at txid #5 in the first stream, so I switch to
the second stream. This stream will then return "null" after reading txid #15,
even though there are really 5 more txns in the group. Right?
This is the intended behavior. If the stream is NOT the last one in the edit
log, then the reader will notice a gap and throw an exception. To extend your
example, if there was a (21 - 30) stream following the (1 - 20) stream, then
the gap would be immediately apparent to the reader. This gap would prevent
normal startup.
If the stream is the last (unfinalized) one in the edit log, then we'll simply
only believe that there are 15 transactions in total. This may seem like the
wrong thing to do, but consider the following scenario:
1. NameNode writes out transactions 16-20 to the first edit log
2. NameNode dies WITHOUT acknowledging transactions 16-20 to the clients or
writing it to edit log #2
3. StandbyNameNode tries to take over
Do you want step 4 to be "StandbyNameNode crashes because the unfinalized edit
logs had different lengths" or "StandbyNameNode starts up normally"? :)
bq. I don't like using the term "automatic failover" here - because that's the
terminology we use for HA. Instead, perhaps something like "We could not find
any other edit log which contains transactions following txid %d"?
Yeah, I guess that was rather confusing. I'll change it to "edit log failover"
or something.
With regard to the state machine, I agree that an ASCII art diagram would help.
The state machine itself makes things a lot simpler because otherwise you'd
have like a dozen variables interacting in complex ways, as opposed to just
curIdx, prevTxId, and prevException.
> During the normal loading NN startup process, fall back on a different
> EditLog if we see one that is corrupt
> ------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-3049
> URL: https://issues.apache.org/jira/browse/HDFS-3049
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: name-node
> Affects Versions: 0.23.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Priority: Minor
> Attachments: HDFS-3049.001.patch, HDFS-3049.002.patch,
> HDFS-3049.003.patch, HDFS-3049.005.against3335.patch,
> HDFS-3049.006.against3335.patch, HDFS-3049.007.against3335.patch,
> HDFS-3049.010.patch, HDFS-3049.011.patch, HDFS-3049.012.patch,
> HDFS-3049.013.patch, HDFS-3049.015.patch, HDFS-3049.017.patch,
> HDFS-3049.018.patch, HDFS-3049.021.patch, HDFS-3049.023.patch,
> HDFS-3049.025.patch, HDFS-3049.026.patch
>
>
> During the NameNode startup process, we load an image, and then apply edit
> logs to it until we believe that we have all the latest changes.
> Unfortunately, if there is an I/O error while reading any of these files, in
> most cases, we simply abort the startup process. We should try harder to
> locate a readable edit log and/or image file.
> *There are three main use cases for this feature:*
> 1. If the operating system does not honor fsync (usually due to a
> misconfiguration), a file may end up in an inconsistent state.
> 2. In certain older releases where we did not use fallocate() or similar to
> pre-reserve blocks, a disk full condition may cause a truncated log in one
> edit directory.
> 3. There may be a bug in HDFS which results in some of the data directories
> receiving corrupt data, but not all. This is the least likely use case.
> *Proposed changes to normal NN startup*
> * We should try a different FSImage if we can't load the first one we try.
> * We should examine other FSEditLogs if we can't load the first one(s) we try.
> * We should fail if we can't find EditLogs that would bring us up to what we
> believe is the latest transaction ID.
> Proposed changes to recovery mode NN startup:
> we should list out all the available storage directories and allow the
> operator to select which one he wants to use.
> Something like this:
> {code}
> Multiple storage directories found.
> 1. /foo/bar
> edits__curent__XYZ size:213421345 md5:2345345
> image size:213421345 md5:2345345
> 2. /foo/baz
> edits__curent__XYZ size:213421345 md5:2345345345
> image size:213421345 md5:2345345
> Which one would you like to use? (1/2)
> {code}
> As usual in recovery mode, we want to be flexible about error handling. In
> this case, this means that we should NOT fail if we can't find EditLogs that
> would bring us up to what we believe is the latest transaction ID.
> *Not addressed by this feature*
> This feature will not address the case where an attempt to access the
> NameNode name directory or directories hangs because of an I/O error. This
> may happen, for example, when trying to load an image from a hard-mounted NFS
> directory, when the NFS server has gone away. Just as now, the operator will
> have to notice this problem and take steps to correct it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira