[ 
https://issues.apache.org/jira/browse/HDFS-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440006#comment-13440006
 ] 

Todd Lipcon commented on HDFS-3845:
-----------------------------------

The edge cases discovered center around the treatment of a particularly nasty 
failure case that was partially described in the design doc, where a minority 
of nodes receive the first transaction of a new segment, and then some very 
specific faults occur during a subsequent recovery of the same segment. One 
example, which is covered by the new targeted test cases:

*Initial writer*
- Writing to 3 JNs: JN0, JN1, JN2: 
- A log segment with txnid 1 through 100 succeeds.
- The first transaction in the next segment only goes to JN0 before the writer 
crashes (eg it is partitioned)
  
*Recovery by another writer*
- The new NN starts recovery and talks to all three. Thus, it sees that the 
newest log segment which needs recovery is 101. 
- It sends the prepareRecovery(101) call, and decides that the recovery length 
for 101 is only the 1 transaction.
- It sends acceptRecovery(101-101) to only JN0, before crashing

This yields the following state:
- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101
- JN1: 1-100 finalized, 101_inprogress.empty
- JN2: 1-100 finalized, 101_inprogress.empty
 (the .empty files got moved aside during recovery)

If a new writer now comes along and writes to only JN1 and JN2, and crashes 
during that segment, we are left with the following:

- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101
- JN1: 1-100 finalized, 101_inprogress with txns 101-150
- JN2: 1-100 finalized, 101_inprogress with txns 101-150

A recovery at this point, with the old protocol, would incorrectly truncate the 
log to transaction ID 101.


The solution is to modify the protocol by cribbing an idea from ZAB. In 
addition to having the "accepted epoch" below which no RPCs will be responded 
to, we also have a "writer epoch", which is set only when a new writer has 
successfully performed recovery and starts to write to a node. In the case 
above, the "writer epoch" gets incremented by the new writer writing 
transactions 101 through 150, so that the augmented state looks like:


- JN0: 1-100 finalized, 101_inprogress, accepted recovery: 101-101, 
lastWriterEpoch=1
- JN1: 1-100 finalized, 101_inprogress with txns 101-150, lastWriterEpoch=3
- JN2: 1-100 finalized, 101_inprogress with txns 101-150, lastWriterEpoch=3

The recovery protocol then uses the fact that JN1 and JN2 have a higher "writer 
epoch" to distinguish JN0's replica as stale, and perform the correct recovery.

This idea was briefly mentioned in the design doc, but rejected because I 
thought we could get away with a shortcut hack. But as described above, the 
hack doesn't work when you have this particular sequence of faults. So, fixing 
it in a more principled way makes sense at this point.

The other improvement in the recovery protocol is that, previously, the 
JournalNode response to the prepareRecovery RPC without correct information as 
to the in-progress/finalized state of the segment. This was because it used the 
proposed recovery SegmentInfo, instead of the state on disk, when there was a 
previously accepted proposal. The new protocol changes this behavior to now 
assert that, if there was a previous accepted recovery, the state on disk 
matches (in length) what was accepted, and then to send that back to the 
client. This allows the client to distinguish a recovery that was already 
finalized from one that did not yet reach the finalization step, allowing for 
better sanity checks. With the writerEpoch change above, I don't this change 
was necessary for correctness, but it is an extra layer of assurance that we 
don't accidentally change the length of a segment if any replica has been 
finalized.

With these improvements in place, I've been able to run several machine-weeks 
of the randomized fault tests without discovering any more issues in the 
protocol (I hooked it up to a Streaming job to run in parallel on a 100-node 
test cluster).
                
> Fixes for edge cases in QJM recovery protocol
> ---------------------------------------------
>
>                 Key: HDFS-3845
>                 URL: https://issues.apache.org/jira/browse/HDFS-3845
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> After running the randomized tests from HDFS-3800 for several machine-weeks, 
> we identified a few edge cases in the recovery protocol that weren't properly 
> designed for. This JIRA is to modify the protocol slightly to address these 
> edge cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to