[
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465164#comment-13465164
]
Todd Lipcon commented on HDFS-3077:
-----------------------------------
bq. "Henceforth we will refer to these nodes as replicas." Please use a
different term as replicas is heavily used in the context of block replica in
HDFS. Perhaps Journal Replicas may be a better name.
Fixed
bq. "Before taking action in response to any RPC, the JournalNode checks the
requester's epoch number against its lastPromisedEpoch variable. If the
requester's epoch is lower, then it will reject the request". This is only true
for all the RPCs other than newEpoch. Further it should say if the requester's
epoch is not equal to lastPromisedEpoch the request is rejected.
Fixed. Actually, if any request comes with a higher epoch than
lastPromisedEpoch, then the JN accepts the request and also updates
lastPromisedEpoch. This allows a JournalNode to join back into the quorum
properly even if it was down when the writer became active.
bq. In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the
description should also read "JNs" instead of "QJMs".
Thanks, fixed.
bq. In step 4. "Otherwise, it aborts the attempt to become the active writer."
What is the state of QJM after this at the namenode? More details needed.
Clarified:
{code}
Otherwise, it aborts the attempt to become the active writer by throwing
an IOException. This will be handled by the NameNode in the same fashion as a
failure to write
to an NFS mount -- if the QJM is being used as a shared edits volume, it will
cause the NN to
abort.
{code}
bq. Section 2.6, bullet 3 - is synchronization on quorum nodes done for only
the last segments or all the segments (required for a given fsimage?). Based on
the answer, section 2.8 might require updates.
It only synchronizes the last log segment. Any earlier segments are already
guaranteed to be finalized on a quorum of nodes (either by a postcondition of
the recovery process, or by the fact that a new segment is not started by a
writer until the previous one is finalized on a quorum).
In the future, we might have a background thread synchronizing earlier log
segments to "minority JNs" who were down when they were written, but we already
have a guarantee that a quorum has every segment.
bq. Say a new JN is added or an older JN came backup during restart of the
cluster. I think you may achieve quorum without the overlap of a node that was
part of previous quorum write. This could result in loading stale journal. How
do we handle this? Is set of JNs that the system was configured/working with?
The JNs don't auto-format themselves, so if you bring up a new one with no
data, or otherwise end up contacting one that wasn't part of the previous
quorum, then it won't be able to respond to the newEpoch() call. It will throw
a "JournalNotFormattedException".
As for adding new journals, the process today would be:
a) shut down HDFS cleanly
b) rsync one of the JN directories to the new nodes
c) add new nodes to the qjournal URI
d) restart HDFS
As I understand it, this is how new nodes are added to ZooKeeper quorums as
well. In the future we might add a feature to help admins with this, but it's a
really rare circumstance, so I think it's better to eschew the complexity in
the initial release. (ZK itself is only adding online quorum reconfiguration
now)
The JNs also keep track of the namespace ID and will reject requests from a
writer if his nsid doesn't match, which prevents accidental "mixing" of nodes
between clusters.
bq. What is the effect of newEpoch from another writer on a JournalNode that is
performing recovery, especially when it is performing AcceptRecovery? It would
be good to cover what happens in other states as well.
Since all of the RPC calls are synchronized, there are no race conditions
_during_ the RPC. If a new writer performs newEpoch before acceptRecovery, then
the acceptRecovery call will fail. If the new writer performs newEpoch after
acceptRecovery, then the new one will get word of the previous writer's
recovery proposal when it calls prepareRecovery().
This part follows Paxos pretty closely, and I didn't want to digress too much
into explaining Paxos in the design doc. I'm happy to add an Appendix with a
couple of these examples, though, if you think that would be useful.
bq. In "Prepare Recovery RPC", how does writer use previously accepted recovery
proposal?
Per above, this follows Paxos. If there are previously accepted proposals, then
the new writer chooses them preferentially even if there are other segments
which might be longer -- see section 3.9 point 3.a.
bq. Does accept recovery wait till journal segments are downloaded? How does
the timeout work for this?
Yep, it downloads the new segment, then atomically renames the segment from its
temporary location and records the accepted recovery. The timeout here is the
same as "dfs.image.transfer.timeout" (default 60sec). If it times out, then it
will throw an exception and not accept the recovery. If the writer performing
recovery doesn't succeed on a majority of nodes, then it will fail at this step.
bq. Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of
that logger's lastWriterEpoch and the epoch number corresponding to any
previously accepted recovery proposal." Can you explain in section 2.10.6 why
previously accepted recovery proposal needs to be considered?
This is necessary in case a writer fails in the middle of recovery. Here's an
example, which I'll also add to the design doc:
Assume we have failed with the three JNs at different lengths, as in Example
2.10.2:
|| JN || segment || last txid || acceptedInEpoch || lastWriterEpoch ||
| JN1 | edits_inprogress_101 | 150 | - | 1 |
| JN2 | edits_inprogress_101 | 153 | - | 1 |
| JN3 | edits_inprogress_101 | 125 | - | 1 |
Now assume that the first recovery attempt only contacts JN1 and JN3. It
decides that length 150 is the
correct recovery length, and calls {{acceptRecovery(150)}} on JN1 and JN3,
followed by
{{ finalizeLogSegment(101-150) }}. But, it crashes before the
{{finalizeLogSegment}} call reaches JN1.
The state now is:
|| JN || segment || last txid || acceptedInEpoch || lastWriterEpoch ||
|JN1 | edits_inprogress_101 | 150 | 2 | 1 |
|JN2 | edits_inprogress_101 | 153 | - | 1 |
|JN3 | edits_101-150 | 150 | - | 1 |
When a new NN now begins recovery, assume it talks only to JN1 and JN2. If it
did not consider
{{acceptedInEpoch}}, it would incorrectly decide to finalize to txid 153, which
would break
the invariant that finalized log segments beginning at the same transaction ID
must have the
same length. Because of Rule 3.b, it will instead choose JN1 again as the
recovery master, and properly finalize JN1 and JN2 to txid 150 instead of 153,
which match
the now-crashed JN3.
bq. Section 3 - since a reader can read from any JN, if the JN it is reading
from gets disconnected from active, does the reader know about it? How does
this work especially in the context of standby namenode?
Though the SBN can read each segment from any one of the JNs, it actually sends
the "getEditLogManifest" to _all_ of the JNs. Then, it takes the results, and
merges them using RedundantEditLogInputStream. So, if two JNs are up which have
a certain segment, then both are available for reading. If, then, in the middle
of the read, it crashes, the SBN can "fail over" to reading from the other JN
that had this same segment.
bq. Following additional things would be good to cover in the design:
bq. Cover boot strapping of JournalNode and how it is formatted
Added a section to the design doc on bootstrap and format
bq. Section 2.8 "replacing any current copy of the log segment". Need more
details here. Is it possible that we delete a segment and due to correlated
failures, we lose the journal data in the process. So replacing must perhaps
keep the old log segment until the segment recovery completes.
Can you give a specific example of what you mean here? We don't delete the
existing segment except when we are moving a new one on top of it -- and the
new one has already been determined to be a "valid recovery". The download
process via HTTP also uses FileChannel.force() after downloading to be sure
that the new file is fully on disk before it is moved into place.
bq. How is addition, deletion and JN becoming live again from the previous
state of dead/very slow handled?
On each segment roll, the client will again retry writing to all of the JNs,
even those that had been marked "out of sync" during the previous log segment.
If it's just lagging a bit, then the queueing in the IPCLoggerChannel handles
that (it'll start the new segment a bit behind the other nodes, but that's
fine). Is there a psecific example I can explain that would make this clearer?
bq. I am still concerned (see my previous comments about epochs using JNs) that
a NN that does not hold the ZK lock can still cause service interruption. This
is could be considered later as an enhancement. This probably is a bigger
discussion.
Yea, I agree this is worth a separate discussion. There's no real way to tie a
ZK lock to anything except for ZK data - you can always think you have the
lock, but by the time you take action, not have it anymore.
bq. I saw couple of white space/empty line changes
Will take care of these, sorry.
bq. Also moving some of the documentation around can be done in trunk, or that
particular change can be merged to trunk to keep this patch smaller.
It seems wrong to merge the docs change to trunk when the code it's documenting
isn't there, yet. Aaron posted some helpful diffs with the docs on HDFS-3926if
you want to review the diff without all the extra diff caused by the moving.
bq. An additional comment - in 3092 design during recovery we had just fence
(newEpoch() here) and roll. I am not sure why recovery needs to have so many
steps - prepare, accept and roll. Can you please describe what I am missing?
I think some of the above comments may explain this - in particular the reason
why you need the idea of accepting recovery prior to committing it. Otherwise,
I'll turn the question on its head: why do you think you can get away with so
few steps? Perhaps it's possible in a system that requires every write to go
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: ha, name-node
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Fix For: QuorumJournalManager (HDFS-3077)
>
> Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt,
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf,
> qjournal-design.pdf, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on
> shared storage such as an NFS filer for the shared edit log. One alternative
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject
> which provides a highly available replicated edit log on commodity hardware.
> This JIRA is to implement another alternative, based on a quorum commit
> protocol, integrated more tightly in HDFS and with the requirements driven
> only by HDFS's needs rather than more generic use cases. More details to
> follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira