[
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464011#comment-13464011
]
Suresh Srinivas commented on HDFS-3077:
---------------------------------------
Finally read through the design :-).
Design document comments:
# "Henceforth we will refer to these nodes as replicas." Please use a different
term as replicas is heavily used in the context of block replica in HDFS.
Perhaps Journal Replicas may be a better name.
# "Before taking action in response to any RPC, the JournalNode checks the
requester's epoch number
against its lastPromisedEpoch variable. If the requester's epoch is lower, then
it will reject the request". This is only true for all the RPCs other than
newEpoch. Further it should say if the requester's epoch is not equal to
lastPromisedEpoch the request is rejected.
Ensure
# In Generating epoch numbers section
#* In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the
description should also read "JNs" instead of "QJMs".
#* In step 4. "Otherwise, it aborts the attempt to become the active writer."
What is the state of QJM after this at the namenode? More details needed.
# Section 2.6, bullet 3 - is synchronization on quorum nodes done for only the
last segments or all the segments (required for a given fsimage?). Based on the
answer, section 2.8 might require updates.
# Say a new JN is added or an older JN came backup during restart of the
cluster. I think you may achieve quorum without the overlap of a node that was
part of previous quorum write. This could result in loading stale journal. How
do we handle this? Is set of JNs that the system was configured/working with?
# What is the effect of newEpoch from another writer on a JournalNode that is
performing recovery, especially when it is performing AcceptRecovery? It would
be good to cover what happens in other states as well.
# In "Prepare Recovery RPC", how does writer use previously accepted recovery
proposal?
# Does accept recovery wait till journal segments are downloaded? How does the
timeout work for this?
# Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of that
logger's lastWriterEpoch and the epoch number corresponding to any previously
accepted recovery proposal." Can you explain in section 2.10.6 why previously
accepted recovery proposal needs to be considered?
# Section 3 - since a reader can read from any JN, if the JN it is reading from
gets disconnected from active, does the reader know about it? How does this
work especially in the context of standby namenode?
# Following additional things would be good to cover in the design:
#* Cover boot strapping of JournalNode and how it is formatted
#* Section 2.8 "replacing any current copy of the log segment". Need more
details here. Is it possible that we delete a segment and due to correlated
failures, we lose the journal data in the process. So replacing must perhaps
keep the old log segment until the segment recovery completes.
#* How is addition, deletion and JN becoming live again from the previous state
of dead/very slow handled?
# I am still concerned (see my previous comments about epochs using JNs) that a
NN that does not hold the ZK lock can still cause service interruption. This is
could be considered later as an enhancement. This probably is a bigger
discussion.
As regards to code changes:
# I saw couple of white space/empty line changes
# Also moving some of the documentation around can be done in trunk, or that
particular change can be merged to trunk to keep this patch smaller.
I will continue with the review of code.
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: ha, name-node
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Fix For: QuorumJournalManager (HDFS-3077)
>
> Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt,
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf,
> qjournal-design.pdf, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on
> shared storage such as an NFS filer for the shared edit log. One alternative
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject
> which provides a highly available replicated edit log on commodity hardware.
> This JIRA is to implement another alternative, based on a quorum commit
> protocol, integrated more tightly in HDFS and with the requirements driven
> only by HDFS's needs rather than more generic use cases. More details to
> follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira