[
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470593#comment-13470593
]
Todd Lipcon commented on HDFS-3077:
-----------------------------------
bq. You raised the objection that this breaks the Journal abstraction. Think of
this as an "info-field" of the special no-op transaction where the journal impl
specific information is stored;
This would be problematic for several reasons:
1) "rollEdits" is not a JournalManager operation. The JournalManager treats
edits as opaque things written by the higher level FSEditLog code. Thus it
cannot inject/modify the operations.
2) If the JournalManager is meant to modify the transaction content, this
implies that two different JournalManagers would produce different values for
the same transaction. Thus, the locally-stored edit log segment would differ in
contents from a remotely stored edit log segment. This makes me really nervous:
we should see multiple copies of a log as identical replicas of the same
information, not adulterated with any storage-specific info.
3) In order to address the above issues, we'd have to add QJM-specific code
into the NameNode, and introduce the concept of epochs into the generic
interfaces. This "bleed" of QJM concepts into the main source code is something
we are explicitly trying to avoid by introducing the JournalManager API.
I am also thinking back to our discussion last summer during the HDFS-1073 work
(particularly HDFS-2018 and HDFS-1580), where you had argued that segments
themselves should be considered an implementation detail of the JournalManager.
So, adding information which is required for correctness into the
START_LOG_SEGMENT written by the NameNode layer takes us farther away from that
goal instead of closer to it.
bq. Suresh and I have been looking at the design and compared it to Paxos and
Zab in detail and have concluded that the design is closer to ZAB than Paxos...
Sure, it's very close to ZAB as well, which I mentioned above in the
discussion. I honestly see ZAB and Paxos as basically the same thing -- ZAB
(and QJM) use something very close to Paxos when they switch epochs. The main
difference between QJM and ZAB is that ZAB actually maintains full histories at
each of the nodes, because it needs to implement a state machine (the database
state). In contrast, QJM allows a journal node to get kicked out for one
segment, then join again in the next segment even if it's missing some txns in
between. This is OK because it is not trying to maintain state, just act as
storage, and IMO it makes things simpler. This difference is enough that I
don't think we should explicitly say that this is an implementation of ZAB.
To be perfectly frank, I'm not interested in changing the design substantially
at this point without a good reason. I've put several weeks into testing this
design, and unless you can find a counter-example or a bug, I am against
changing it. If you want to do the work and produce a patch which makes the
code simpler, and it can pass 20,000 runs of the randomized fault test, I'd be
happy to review your patch. Or if you can point a flaw out in the current
design that's addressed by your proposed change, I'll do the work to address
it. But as is, I am confident that the design is correct and don't have more
time to allocate to shifting things around unless there's a bug or another real
problem which would negatively affect its usage.
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: ha, name-node
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Fix For: QuorumJournalManager (HDFS-3077)
>
> Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt,
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf,
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf,
> qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on
> shared storage such as an NFS filer for the shared edit log. One alternative
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject
> which provides a highly available replicated edit log on commodity hardware.
> This JIRA is to implement another alternative, based on a quorum commit
> protocol, integrated more tightly in HDFS and with the requirements driven
> only by HDFS's needs rather than more generic use cases. More details to
> follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira