[
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403494#comment-13403494
]
Todd Lipcon commented on HDFS-3077:
-----------------------------------
The following are the diffs introduced by the HDFS-3092 branch as of 1346682
{code}
todd@todd-w510:~/git/hadoop-common/hadoop-hdfs-project$ git diff --stat
origin/trunk..HDFS-3092
.../hadoop-hdfs/CHANGES.HDFS-3092.txt | 42 +++
hadoop-hdfs-project/hadoop-hdfs/pom.xml | 23 ++
.../java/org/apache/hadoop/hdfs/DFSConfigKeys.java | 12 +-
.../src/main/webapps/journal/index.html | 29 ++
.../src/main/webapps/journal/journalstatus.jsp | 42 +++
.../src/main/webapps/proto-journal-web.xml | 17 +
.../main/java/org/apache/hadoop/hdfs/DFSUtil.java | 58 ++++
.../java/org/apache/hadoop/hdfs/TestDFSUtil.java | 41 +++
.../hdfs/protocol/UnregisteredNodeException.java | 4 +
.../hdfs/protocolPB/JournalSyncProtocolPB.java | 41 +++
.../JournalSyncProtocolServerSideTranslatorPB.java | 60 ++++
.../JournalSyncProtocolTranslatorPB.java | 79 +++++
.../server/protocol/JournalServiceProtocols.java | 27 ++
.../hdfs/server/protocol/JournalSyncProtocol.java | 58 ++++
.../src/main/proto/JournalSyncProtocol.proto | 57 +++
.../journalservice/GetJournalEditServlet.java | 177 ++++++++++
.../hadoop/hdfs/server/journalservice/Journal.java | 130 +++++++
.../server/journalservice/JournalDiskWriter.java | 61 ++++
.../server/journalservice/JournalHttpServer.java | 172 ++++++++++
.../server/journalservice/JournalListener.java | 4 +-
.../hdfs/server/journalservice/JournalService.java | 359 +++++++++++++++++---
.../hadoop/hdfs/server/namenode/FSEditLog.java | 65 ++++-
.../hadoop/hdfs/server/namenode/FSImage.java | 4 +-
.../hdfs/server/namenode/GetImageServlet.java | 10 +-
.../hadoop/hdfs/server/namenode/NNStorage.java | 4 +-
.../hdfs/server/namenode/TransferFsImage.java | 4 +-
.../hadoop/hdfs/server/namenode/JournalSet.java | 34 ++
.../org/apache/hadoop/hdfs/MiniDFSCluster.java | 9 +
.../hdfs/server/journalservice/TestJournal.java | 71 ++++
.../journalservice/TestJournalHttpServer.java | 311 +++++++++++++++++
.../server/journalservice/TestJournalService.java | 150 +++++++--
31 files changed, 2071 insertions(+), 84 deletions(-)
{code}
For each of these, I'll explain how the code was incorporated in the 3077 work.
Or, in the case that the code was not carried over, explain why it doesn't make
sense.
{code}
hadoop-hdfs-project/hadoop-hdfs/pom.xml | 23 ++
.../java/org/apache/hadoop/hdfs/DFSConfigKeys.java | 12 +-
.../src/main/webapps/journal/index.html | 29 ++
.../src/main/webapps/journal/journalstatus.jsp | 42 +++
.../src/main/webapps/proto-journal-web.xml | 17 +
{code}
These are carried over, with the exception of the HTTPS-related keys. Trunk has
moved away from Kerberized HTTPS and towards SPNEGO for authenticated HTTP.
{code}
.../main/java/org/apache/hadoop/hdfs/DFSUtil.java | 58 ++++
.../java/org/apache/hadoop/hdfs/TestDFSUtil.java | 41 +++
{code}
These diffs provided a way to parse out the list of journal nodes from the
Configuration object. But, it makes more sense to provide this list in the
actual URI of the edits logs, for consistency with the existing BKJM
implementation. So, the equivalent method now exists as
{{QuorumJournalManager.getLoggerAddresses(URI)}}. Moving it from DFSUtil to QJM
also made sense so that all of the new code was self-contained in its own
package (per the spirit of Journal Managers being pluggable components, I don't
think we should refer to them from the main DFS code)
{code}
.../hdfs/protocol/UnregisteredNodeException.java | 4 +
{code}
Given that the NN is configured with a static list of Journal Managers to write
to, and that membership is static, there's no need for a registration concept
with the JNs -- the NN sends RPCs _to_ the JNs, rather than the other way
around.
{code}
.../hdfs/protocolPB/JournalSyncProtocolPB.java | 41 +++
.../JournalSyncProtocolServerSideTranslatorPB.java | 60 ++++
.../JournalSyncProtocolTranslatorPB.java | 79 +++++
.../server/protocol/JournalServiceProtocols.java | 27 ++
.../hdfs/server/protocol/JournalSyncProtocol.java | 58 ++++
.../src/main/proto/JournalSyncProtocol.proto | 57 +++
{code}
I merged this protocol in with the other RPC protocol, since it reused the same
types anyway. If there's a strong motivation to have this as a separate
protocol, I could be convinced, but I think we need to make use of JN-specific
items like epoch ID in here, which wouldn't make sense in the context of a NN
or other edits storage.
{code}
.../journalservice/GetJournalEditServlet.java | 177 ++++++++++
{code}
This has been moved into the qjournal package. It is otherwise mostly the same.
The one salient difference is that it now only accepts the {{startTxId}} of the
segment to be downloaded. This was necessary because, during the recovery step,
the source node for synchronization may finalize its log segment while other
nodes are in the process of synchronizing from it. So, we look for either
in-progress or finalized logs. I can explain this in further detail if
necessary.
{code}
.../hadoop/hdfs/server/journalservice/Journal.java | 130 +++++++
.../server/journalservice/JournalDiskWriter.java | 61 ++++
{code}
These have been merged and moved to the qjournal package, and simplified to
work directly with a single FileJournalManager instead of an FSEditLog and
NNStorage. The reasoning is that a journal's storage is not the same as a
NameNode's storage, nor is it a fully general-purpose wrapper around edit logs.
For example, we are not trying to support running a JournalNode which itself
logs to a pluggable backend log. Additionally, at this point, we can only
correctly support a single directory under a quorum participant, or else there
are a lot more edge cases to consider where a node may renege on its promises
if its set of directories changes during a restart.
{code}
.../server/journalservice/JournalHttpServer.java | 172 ++++++++++
{code}
This has been moved to the qjournal package, and otherwise mostly the same. The
one difference is that I switched to SPNEGO instead of the kerberized SSL, to
match trunk.
{code}
.../server/journalservice/JournalListener.java | 4 +-
{code}
The change here in the 3092 branch was just adding a new exception to a method,
which didn't turn out to be necessary anymore.
{code}
.../hdfs/server/journalservice/JournalService.java | 359 +++++++++++++++++---
{code}
This class is called JournalNode now. We no longer have a state machine here -
the recovery process/syncing process is coordinated by the NameNode/client
side, and the commit protocol ensures that every segment is always finalized on
a majority of nodes. So the state machine isn't necessary to ensure that all
the log segments are replicated successfully.
{code}
.../hadoop/hdfs/server/namenode/FSEditLog.java | 65 ++++-
.../hadoop/hdfs/server/namenode/FSImage.java | 4 +-
{code}
The new functions added here were necessary for the synchronization process
above, but not necessary for the recovery process implemented by 3077.
{code}
.../hdfs/server/namenode/GetImageServlet.java | 10 +-
.../hadoop/hdfs/server/namenode/NNStorage.java | 4 +-
.../hdfs/server/namenode/TransferFsImage.java | 4 +-
{code}
Just changes to make things public -- same changes in 3077 patch.
{code}
.../hadoop/hdfs/server/namenode/JournalSet.java | 34 ++
{code}
Since the Journal talks directly to FJM in 3077, the {{getFinalizedSegments}}
function added here is instead fulfilled by making the existing function
{{FileJournalManager.getLogFiles()}} public.
{code}
.../org/apache/hadoop/hdfs/MiniDFSCluster.java | 9 +
{code}
The 3077 branch has a MiniJournalCluster which can start/stop/restart
JournalNodes. The change here in the 3092 branch appears to be incomplete -- it
adds functions which are never called.
{code}
.../hdfs/server/journalservice/TestJournal.java | 71 ++++
.../journalservice/TestJournalHttpServer.java | 311 +++++++++++++++++
.../server/journalservice/TestJournalService.java | 150 +++++++--
{code}
Several of these tests are carried over into similarly named tests in 3077.
Others didn't make sense due to changes mentioned above. The test coverage in
the patch attached here is comparable to the coverage in 3092, and I'm working
on adding a lot more tests currently.
So, to summarize the key similarities and differences:
- most of the HTTP server, RPC server, node wrapper code, JSP pages, journal
HTTP servlet carried over
-- SPNEGO support instead of KSSL
-- Lifecycle code a little different in order to work with MiniJournalCluster
- the edits "synchronization" code is mostly removed, since the synchronization
is now NN-led. If we feel strongly that the JNs should synchronize "old" log
segments, instead of just ensuring that every segment is stored on a quorum, we
can bring some of this back. But we're already guaranteed that every segment
has two replicas by this design.
- new JNStorage class, since the JN doesn't actually share most of its storage
code with the NameNode (eg we have no images, no checkpoints, no distributed
upgrade coordination, etc
- Journal updated to use above JNStorage class instead of NNStorage and
FSEditLog
- Not attempting to share the RPC protocol with BackupNode or NameNode, since
we need quorum-specific information like - Not attempting to share the RPC
protocol with BackupNode or NameNode, since we need quorum-specific information
like epoch numbers in every RPC, and those don't make sense in the other
contexts.
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: ha, name-node
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hdfs-3077-partial.txt, hdfs-3077.txt,
> qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on
> shared storage such as an NFS filer for the shared edit log. One alternative
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject
> which provides a highly available replicated edit log on commodity hardware.
> This JIRA is to implement another alternative, based on a quorum commit
> protocol, integrated more tightly in HDFS and with the requirements driven
> only by HDFS's needs rather than more generic use cases. More details to
> follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira