[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470593#comment-13470593
 ] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. You raised the objection that this breaks the Journal abstraction. Think of 
this as an "info-field" of the special no-op transaction where the journal impl 
specific information is stored; 

This would be problematic for several reasons:
1) "rollEdits" is not a JournalManager operation. The JournalManager treats 
edits as opaque things written by the higher level FSEditLog code. Thus it 
cannot inject/modify the operations.
2) If the JournalManager is meant to modify the transaction content, this 
implies that two different JournalManagers would produce different values for 
the same transaction. Thus, the locally-stored edit log segment would differ in 
contents from a remotely stored edit log segment. This makes me really nervous: 
we should see multiple copies of a log as identical replicas of the same 
information, not adulterated with any storage-specific info.
3) In order to address the above issues, we'd have to add QJM-specific code 
into the NameNode, and introduce the concept of epochs into the generic 
interfaces. This "bleed" of QJM concepts into the main source code is something 
we are explicitly trying to avoid by introducing the JournalManager API.

I am also thinking back to our discussion last summer during the HDFS-1073 work 
(particularly HDFS-2018 and HDFS-1580), where you had argued that segments 
themselves should be considered an implementation detail of the JournalManager. 
So, adding information which is required for correctness into the 
START_LOG_SEGMENT written by the NameNode layer takes us farther away from that 
goal instead of closer to it.

bq. Suresh and I have been looking at the design and compared it to Paxos and 
Zab in detail and have concluded that the design is closer to ZAB than Paxos...

Sure, it's very close to ZAB as well, which I mentioned above in the 
discussion. I honestly see ZAB and Paxos as basically the same thing -- ZAB 
(and QJM) use something very close to Paxos when they switch epochs. The main 
difference between QJM and ZAB is that ZAB actually maintains full histories at 
each of the nodes, because it needs to implement a state machine (the database 
state). In contrast, QJM allows a journal node to get kicked out for one 
segment, then join again in the next segment even if it's missing some txns in 
between. This is OK because it is not trying to maintain state, just act as 
storage, and IMO it makes things simpler. This difference is enough that I 
don't think we should explicitly say that this is an implementation of ZAB.

To be perfectly frank, I'm not interested in changing the design substantially 
at this point without a good reason. I've put several weeks into testing this 
design, and unless you can find a counter-example or a bug, I am against 
changing it. If you want to do the work and produce a patch which makes the 
code simpler, and it can pass 20,000 runs of the randomized fault test, I'd be 
happy to review your patch. Or if you can point a flaw out in the current 
design that's addressed by your proposed change, I'll do the work to address 
it. But as is, I am confident that the design is correct and don't have more 
time to allocate to shifting things around unless there's a bug or another real 
problem which would negatively affect its usage.
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to