[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Suresh Srinivas (JIRA) Wed, 26 Sep 2012 11:15:14 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464011#comment-13464011
 ]


Suresh Srinivas commented on HDFS-3077:
---------------------------------------

Finally read through the design :-).

Design document comments:
# "Henceforth we will refer to these nodes as replicas." Please use a different 
term as replicas is heavily used in the context of block replica in HDFS. 
Perhaps Journal Replicas may be a better name.
# "Before taking action in response to any RPC, the JournalNode checks the 
requester's epoch number
against its lastPromisedEpoch variable. If the requester's epoch is lower, then 
it will reject the request". This is only true for all the RPCs other than 
newEpoch. Further it should say if the requester's epoch is not equal to 
lastPromisedEpoch the request is rejected.
Ensure
# In Generating epoch numbers section 
#* In step 3, you mean newEpoch is sent to "JNs" and not QJMs. Rest of the 
description should also read "JNs" instead of "QJMs".
#* In step 4. "Otherwise, it aborts the attempt to become the active writer." 
What is the state of QJM after this at the namenode? More details needed.
# Section 2.6, bullet 3 - is synchronization on quorum nodes done for only the 
last segments or all the segments (required for a given fsimage?). Based on the 
answer, section 2.8 might require updates.
# Say a new JN is added or an older JN came backup during restart of the 
cluster. I think you may achieve quorum without the overlap of a node that was 
part of previous quorum write. This could result in loading stale journal. How 
do we handle this? Is set of JNs that the system was configured/working with?
# What is the effect of newEpoch from another writer on a JournalNode that is 
performing recovery, especially when it is performing AcceptRecovery? It would 
be good to cover what happens in other states as well.
# In "Prepare Recovery RPC", how does writer use previously accepted recovery 
proposal?
# Does accept recovery wait till journal segments are downloaded? How does the 
timeout work for this?
# Section 2.9 - "For each logger, calculate maxSeenEpoch as the greater of that 
logger's lastWriterEpoch and the epoch number corresponding to any previously 
accepted recovery proposal." Can you explain in section 2.10.6 why previously 
accepted recovery proposal needs to be considered?
# Section 3 - since a reader can read from any JN, if the JN it is reading from 
gets disconnected from active, does the reader know about it? How does this 
work especially in the context of standby namenode?
# Following additional things would be good to cover in the design:
#* Cover boot strapping of JournalNode and how it is formatted
#* Section 2.8 "replacing any current copy of the log segment". Need more 
details here. Is it possible that we delete a segment and due to correlated 
failures, we lose the journal data in the process. So replacing must perhaps 
keep the old log segment until the segment recovery completes.
#* How is addition, deletion and JN becoming live again from the previous state 
of dead/very slow handled?
# I am still concerned (see my previous comments about epochs using JNs) that a 
NN that does not hold the ZK lock can still cause service interruption. This is 
could be considered later as an enhancement. This probably is a bigger 
discussion.


As regards to code changes:
# I saw couple of white space/empty line changes
# Also moving some of the documentation around can be done in trunk, or that 
particular change can be merged to trunk to keep this patch smaller.

I will continue with the review of code.
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Reply via email to