[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Todd Lipcon (JIRA) Thu, 27 Sep 2012 15:21:15 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465164#comment-13465164
 ]


Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. "Henceforth we will refer to these nodes as replicas." Please use a 
different term as replicas is heavily used in the context of block replica in 
HDFS. Perhaps Journal Replicas may be a better name.

Fixed

bq. "Before taking action in response to any RPC, the JournalNode checks the 
requester's epoch number against its lastPromisedEpoch variable. If the 
requester's epoch is lower, then it will reject the request". This is only true 
for all the RPCs other than newEpoch. Further it should say if the requester's 
epoch is not equal to lastPromisedEpoch the request is rejected.

Fixed. Actually, if any request comes with a higher epoch than 
lastPromisedEpoch, then the JN accepts the request and also updates 
lastPromisedEpoch. This allows a JournalNode to join back into the quorum 
properly even if it was down when the writer became active.

bq. In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the 
description should also read "JNs" instead of "QJMs".

Thanks, fixed.

bq. In step 4. "Otherwise, it aborts the attempt to become the active writer." 
What is the state of QJM after this at the namenode? More details needed.

Clarified:
{code}
 Otherwise, it aborts the attempt to become the active writer by throwing
 an IOException. This will be handled by the NameNode in the same fashion as a 
failure to write
 to an NFS mount -- if the QJM is being used as a shared edits volume, it will 
cause the NN to
 abort.
{code}


bq. Section 2.6, bullet 3 - is synchronization on quorum nodes done for only 
the last segments or all the segments (required for a given fsimage?). Based on 
the answer, section 2.8 might require updates.

It only synchronizes the last log segment. Any earlier segments are already 
guaranteed to be finalized on a quorum of nodes (either by a postcondition of 
the recovery process, or by the fact that a new segment is not started by a 
writer until the previous one is finalized on a quorum).

In the future, we might have a background thread synchronizing earlier log 
segments to "minority JNs" who were down when they were written, but we already 
have a guarantee that a quorum has every segment.

bq. Say a new JN is added or an older JN came backup during restart of the 
cluster. I think you may achieve quorum without the overlap of a node that was 
part of previous quorum write. This could result in loading stale journal. How 
do we handle this? Is set of JNs that the system was configured/working with?

The JNs don't auto-format themselves, so if you bring up a new one with no 
data, or otherwise end up contacting one that wasn't part of the previous 
quorum, then it won't be able to respond to the newEpoch() call. It will throw 
a "JournalNotFormattedException".

As for adding new journals, the process today would be:
a) shut down HDFS cleanly
b) rsync one of the JN directories to the new nodes
c) add new nodes to the qjournal URI
d) restart HDFS

As I understand it, this is how new nodes are added to ZooKeeper quorums as 
well. In the future we might add a feature to help admins with this, but it's a 
really rare circumstance, so I think it's better to eschew the complexity in 
the initial release. (ZK itself is only adding online quorum reconfiguration 
now)

The JNs also keep track of the namespace ID and will reject requests from a 
writer if his nsid doesn't match, which prevents accidental "mixing" of nodes 
between clusters.

bq. What is the effect of newEpoch from another writer on a JournalNode that is 
performing recovery, especially when it is performing AcceptRecovery? It would 
be good to cover what happens in other states as well.

Since all of the RPC calls are synchronized, there are no race conditions 
_during_ the RPC. If a new writer performs newEpoch before acceptRecovery, then 
the acceptRecovery call will fail. If the new writer performs newEpoch after 
acceptRecovery, then the new one will get word of the previous writer's 
recovery proposal when it calls prepareRecovery().

This part follows Paxos pretty closely, and I didn't want to digress too much 
into explaining Paxos in the design doc. I'm happy to add an Appendix with a 
couple of these examples, though, if you think that would be useful.


bq. In "Prepare Recovery RPC", how does writer use previously accepted recovery 
proposal?

Per above, this follows Paxos. If there are previously accepted proposals, then 
the new writer chooses them preferentially even if there are other segments 
which might be longer -- see section 3.9 point 3.a.

bq. Does accept recovery wait till journal segments are downloaded? How does 
the timeout work for this?

Yep, it downloads the new segment, then atomically renames the segment from its 
temporary location and records the accepted recovery. The timeout here is the 
same as "dfs.image.transfer.timeout" (default 60sec). If it times out, then it 
will throw an exception and not accept the recovery. If the writer performing 
recovery doesn't succeed on a majority of nodes, then it will fail at this step.

bq. Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of 
that logger's lastWriterEpoch and the epoch number corresponding to any 
previously accepted recovery proposal." Can you explain in section 2.10.6 why 
previously accepted recovery proposal needs to be considered?

This is necessary in case a writer fails in the middle of recovery. Here's an 
example, which I'll also add to the design doc:

Assume we have failed with the three JNs at different lengths, as in Example 
2.10.2:

|| JN  || segment || last txid || acceptedInEpoch || lastWriterEpoch ||
| JN1 | edits_inprogress_101 | 150 | - | 1 |
| JN2 | edits_inprogress_101 | 153 | - | 1 |
| JN3 | edits_inprogress_101 | 125 | - | 1 |

Now assume that the first recovery attempt only contacts JN1 and JN3. It 
decides that length 150 is the
correct recovery length, and calls {{acceptRecovery(150)}} on JN1 and JN3, 
followed by
{{ finalizeLogSegment(101-150) }}. But, it crashes before the 
{{finalizeLogSegment}} call reaches JN1.
The state now is:

|| JN  || segment || last txid || acceptedInEpoch || lastWriterEpoch ||
|JN1 | edits_inprogress_101 | 150 | 2 | 1 |
|JN2 | edits_inprogress_101 | 153 | - | 1 |
|JN3 | edits_101-150 | 150 | - | 1 |

When a new NN now begins recovery, assume it talks only to JN1 and JN2. If it 
did not consider
{{acceptedInEpoch}}, it would incorrectly decide to finalize to txid 153, which 
would break
the invariant that finalized log segments beginning at the same transaction ID 
must have the
same length. Because of Rule 3.b, it will instead choose JN1 again as the
recovery master, and properly finalize JN1 and JN2 to txid 150 instead of 153, 
which match
the now-crashed JN3.


bq. Section 3 - since a reader can read from any JN, if the JN it is reading 
from gets disconnected from active, does the reader know about it? How does 
this work especially in the context of standby namenode?

Though the SBN can read each segment from any one of the JNs, it actually sends 
the "getEditLogManifest" to _all_ of the JNs. Then, it takes the results, and 
merges them using RedundantEditLogInputStream. So, if two JNs are up which have 
a certain segment, then both are available for reading. If, then, in the middle 
of the read, it crashes, the SBN can "fail over" to reading from the other JN 
that had this same segment.

bq. Following additional things would be good to cover in the design:
bq. Cover boot strapping of JournalNode and how it is formatted

Added a section to the design doc on bootstrap and format


bq. Section 2.8 "replacing any current copy of the log segment". Need more 
details here. Is it possible that we delete a segment and due to correlated 
failures, we lose the journal data in the process. So replacing must perhaps 
keep the old log segment until the segment recovery completes.

Can you give a specific example of what you mean here? We don't delete the 
existing segment except when we are moving a new one on top of it -- and the 
new one has already been determined to be a "valid recovery". The download 
process via HTTP also uses FileChannel.force() after downloading to be sure 
that the new file is fully on disk before it is moved into place.

bq. How is addition, deletion and JN becoming live again from the previous 
state of dead/very slow handled?

On each segment roll, the client will again retry writing to all of the JNs, 
even those that had been marked "out of sync" during the previous log segment. 
If it's just lagging a bit, then the queueing in the IPCLoggerChannel handles 
that (it'll start the new segment a bit behind the other nodes, but that's 
fine). Is there a psecific example I can explain that would make this clearer?

bq. I am still concerned (see my previous comments about epochs using JNs) that 
a NN that does not hold the ZK lock can still cause service interruption. This 
is could be considered later as an enhancement. This probably is a bigger 
discussion.

Yea, I agree this is worth a separate discussion. There's no real way to tie a 
ZK lock to anything except for ZK data - you can always think you have the 
lock, but by the time you take action, not have it anymore.

bq. I saw couple of white space/empty line changes

Will take care of these, sorry.

bq. Also moving some of the documentation around can be done in trunk, or that 
particular change can be merged to trunk to keep this patch smaller.

It seems wrong to merge the docs change to trunk when the code it's documenting 
isn't there, yet. Aaron posted some helpful diffs with the docs on HDFS-3926if 
you want to review the diff without all the extra diff caused by the moving.


bq. An additional comment - in 3092 design during recovery we had just fence 
(newEpoch() here) and roll. I am not sure why recovery needs to have so many 
steps - prepare, accept and roll. Can you please describe what I am missing?

I think some of the above comments may explain this - in particular the reason 
why you need the idea of accepting recovery prior to committing it. Otherwise, 
I'll turn the question on its head: why do you think you can get away with so 
few steps? Perhaps it's possible in a system that requires every write to go
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Reply via email to