[
https://issues.apache.org/jira/browse/HDFS-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440006#comment-13440006
]
Todd Lipcon commented on HDFS-3845:
-----------------------------------
The edge cases discovered center around the treatment of a particularly nasty
failure case that was partially described in the design doc, where a minority
of nodes receive the first transaction of a new segment, and then some very
specific faults occur during a subsequent recovery of the same segment. One
example, which is covered by the new targeted test cases:
*Initial writer*
- Writing to 3 JNs: JN0, JN1, JN2:
- A log segment with txnid 1 through 100 succeeds.
- The first transaction in the next segment only goes to JN0 before the writer
crashes (eg it is partitioned)
*Recovery by another writer*
- The new NN starts recovery and talks to all three. Thus, it sees that the
newest log segment which needs recovery is 101.
- It sends the prepareRecovery(101) call, and decides that the recovery length
for 101 is only the 1 transaction.
- It sends acceptRecovery(101-101) to only JN0, before crashing
This yields the following state:
- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101
- JN1: 1-100 finalized, 101_inprogress.empty
- JN2: 1-100 finalized, 101_inprogress.empty
(the .empty files got moved aside during recovery)
If a new writer now comes along and writes to only JN1 and JN2, and crashes
during that segment, we are left with the following:
- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101
- JN1: 1-100 finalized, 101_inprogress with txns 101-150
- JN2: 1-100 finalized, 101_inprogress with txns 101-150
A recovery at this point, with the old protocol, would incorrectly truncate the
log to transaction ID 101.
The solution is to modify the protocol by cribbing an idea from ZAB. In
addition to having the "accepted epoch" below which no RPCs will be responded
to, we also have a "writer epoch", which is set only when a new writer has
successfully performed recovery and starts to write to a node. In the case
above, the "writer epoch" gets incremented by the new writer writing
transactions 101 through 150, so that the augmented state looks like:
- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101,
lastWriterEpoch=1
- JN1: 1-100 finalized, 101_inprogress with txns 101-150, lastWriterEpoch=3
- JN2: 1-100 finalized, 101_inprogress with txns 101-150, lastWriterEpoch=3
The recovery protocol then uses the fact that JN1 and JN2 have a higher "writer
epoch" to distinguish JN0's replica as stale, and perform the correct recovery.
This idea was briefly mentioned in the design doc, but rejected because I
thought we could get away with a shortcut hack. But as described above, the
hack doesn't work when you have this particular sequence of faults. So, fixing
it in a more principled way makes sense at this point.
The other improvement in the recovery protocol is that, previously, the
JournalNode response to the prepareRecovery RPC without correct information as
to the in-progress/finalized state of the segment. This was because it used the
proposed recovery SegmentInfo, instead of the state on disk, when there was a
previously accepted proposal. The new protocol changes this behavior to now
assert that, if there was a previous accepted recovery, the state on disk
matches (in length) what was accepted, and then to send that back to the
client. This allows the client to distinguish a recovery that was already
finalized from one that did not yet reach the finalization step, allowing for
better sanity checks. With the writerEpoch change above, I don't this change
was necessary for correctness, but it is an extra layer of assurance that we
don't accidentally change the length of a segment if any replica has been
finalized.
With these improvements in place, I've been able to run several machine-weeks
of the randomized fault tests without discovering any more issues in the
protocol (I hooked it up to a Streaming job to run in parallel on a 100-node
test cluster).
> Fixes for edge cases in QJM recovery protocol
> ---------------------------------------------
>
> Key: HDFS-3845
> URL: https://issues.apache.org/jira/browse/HDFS-3845
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ha
> Affects Versions: QuorumJournalManager (HDFS-3077)
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> After running the randomized tests from HDFS-3800 for several machine-weeks,
> we identified a few edge cases in the recovery protocol that weren't properly
> designed for. This JIRA is to modify the protocol slightly to address these
> edge cases.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira