[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Todd Lipcon (JIRA) Thu, 28 Jun 2012 14:37:48 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403494#comment-13403494
 ]


Todd Lipcon commented on HDFS-3077:
-----------------------------------

The following are the diffs introduced by the HDFS-3092 branch as of 1346682

{code}
todd@todd-w510:~/git/hadoop-common/hadoop-hdfs-project$ git diff --stat 
origin/trunk..HDFS-3092
 .../hadoop-hdfs/CHANGES.HDFS-3092.txt              |   42 +++
 hadoop-hdfs-project/hadoop-hdfs/pom.xml            |   23 ++
 .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java |   12 +-
 .../src/main/webapps/journal/index.html            |   29 ++
 .../src/main/webapps/journal/journalstatus.jsp     |   42 +++
 .../src/main/webapps/proto-journal-web.xml         |   17 +
 .../main/java/org/apache/hadoop/hdfs/DFSUtil.java  |   58 ++++
 .../java/org/apache/hadoop/hdfs/TestDFSUtil.java   |   41 +++
 .../hdfs/protocol/UnregisteredNodeException.java   |    4 +
 .../hdfs/protocolPB/JournalSyncProtocolPB.java     |   41 +++
 .../JournalSyncProtocolServerSideTranslatorPB.java |   60 ++++
 .../JournalSyncProtocolTranslatorPB.java           |   79 +++++
 .../server/protocol/JournalServiceProtocols.java   |   27 ++
 .../hdfs/server/protocol/JournalSyncProtocol.java  |   58 ++++
 .../src/main/proto/JournalSyncProtocol.proto       |   57 +++
 .../journalservice/GetJournalEditServlet.java      |  177 ++++++++++
 .../hadoop/hdfs/server/journalservice/Journal.java |  130 +++++++
 .../server/journalservice/JournalDiskWriter.java   |   61 ++++
 .../server/journalservice/JournalHttpServer.java   |  172 ++++++++++
 .../server/journalservice/JournalListener.java     |    4 +-
 .../hdfs/server/journalservice/JournalService.java |  359 +++++++++++++++++---
 .../hadoop/hdfs/server/namenode/FSEditLog.java     |   65 ++++-
 .../hadoop/hdfs/server/namenode/FSImage.java       |    4 +-
 .../hdfs/server/namenode/GetImageServlet.java      |   10 +-
 .../hadoop/hdfs/server/namenode/NNStorage.java     |    4 +-
 .../hdfs/server/namenode/TransferFsImage.java      |    4 +-
 .../hadoop/hdfs/server/namenode/JournalSet.java    |   34 ++
 .../org/apache/hadoop/hdfs/MiniDFSCluster.java     |    9 +
 .../hdfs/server/journalservice/TestJournal.java    |   71 ++++
 .../journalservice/TestJournalHttpServer.java      |  311 +++++++++++++++++
 .../server/journalservice/TestJournalService.java  |  150 +++++++--
 31 files changed, 2071 insertions(+), 84 deletions(-)
{code}

For each of these, I'll explain how the code was incorporated in the 3077 work. 
Or, in the case that the code was not carried over, explain why it doesn't make 
sense.


{code}
 hadoop-hdfs-project/hadoop-hdfs/pom.xml            |   23 ++
 .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java |   12 +-
 .../src/main/webapps/journal/index.html            |   29 ++
 .../src/main/webapps/journal/journalstatus.jsp     |   42 +++
 .../src/main/webapps/proto-journal-web.xml         |   17 +
{code}

These are carried over, with the exception of the HTTPS-related keys. Trunk has 
moved away from Kerberized HTTPS and towards SPNEGO for authenticated HTTP.

{code}
 .../main/java/org/apache/hadoop/hdfs/DFSUtil.java  |   58 ++++
 .../java/org/apache/hadoop/hdfs/TestDFSUtil.java   |   41 +++
{code}

These diffs provided a way to parse out the list of journal nodes from the 
Configuration object. But, it makes more sense to provide this list in the 
actual URI of the edits logs, for consistency with the existing BKJM 
implementation. So, the equivalent method now exists as 
{{QuorumJournalManager.getLoggerAddresses(URI)}}. Moving it from DFSUtil to QJM 
also made sense so that all of the new code was self-contained in its own 
package (per the spirit of Journal Managers being pluggable components, I don't 
think we should refer to them from the main DFS code)

{code}
 .../hdfs/protocol/UnregisteredNodeException.java   |    4 +
{code}

Given that the NN is configured with a static list of Journal Managers to write 
to, and that membership is static, there's no need for a registration concept 
with the JNs -- the NN sends RPCs _to_ the JNs, rather than the other way 
around.

{code}
 .../hdfs/protocolPB/JournalSyncProtocolPB.java     |   41 +++
 .../JournalSyncProtocolServerSideTranslatorPB.java |   60 ++++
 .../JournalSyncProtocolTranslatorPB.java           |   79 +++++
 .../server/protocol/JournalServiceProtocols.java   |   27 ++
 .../hdfs/server/protocol/JournalSyncProtocol.java  |   58 ++++
 .../src/main/proto/JournalSyncProtocol.proto       |   57 +++
{code}

I merged this protocol in with the other RPC protocol, since it reused the same 
types anyway. If there's a strong motivation to have this as a separate 
protocol, I could be convinced, but I think we need to make use of JN-specific 
items like epoch ID in here, which wouldn't make sense in the context of a NN 
or other edits storage.

{code}
 .../journalservice/GetJournalEditServlet.java      |  177 ++++++++++
{code}

This has been moved into the qjournal package. It is otherwise mostly the same. 
The one salient difference is that it now only accepts the {{startTxId}} of the 
segment to be downloaded. This was necessary because, during the recovery step, 
the source node for synchronization may finalize its log segment while other 
nodes are in the process of synchronizing from it. So, we look for either 
in-progress or finalized logs. I can explain this in further detail if 
necessary.

{code}
 .../hadoop/hdfs/server/journalservice/Journal.java |  130 +++++++
 .../server/journalservice/JournalDiskWriter.java   |   61 ++++
{code}
These have been merged and moved to the qjournal package, and simplified to 
work directly with a single FileJournalManager instead of an FSEditLog and 
NNStorage. The reasoning is that a journal's storage is not the same as a 
NameNode's storage, nor is it a fully general-purpose wrapper around edit logs. 
For example, we are not trying to support running a JournalNode which itself 
logs to a pluggable backend log. Additionally, at this point, we can only 
correctly support a single directory under a quorum participant, or else there 
are a lot more edge cases to consider where a node may renege on its promises 
if its set of directories changes during a restart.

{code}
 .../server/journalservice/JournalHttpServer.java   |  172 ++++++++++
{code}

This has been moved to the qjournal package, and otherwise mostly the same. The 
one difference is that I switched to SPNEGO instead of the kerberized SSL, to 
match trunk.

{code}
 .../server/journalservice/JournalListener.java     |    4 +-
{code}
The change here in the 3092 branch was just adding a new exception to a method, 
which didn't turn out to be necessary anymore.

{code}
 .../hdfs/server/journalservice/JournalService.java |  359 +++++++++++++++++---
{code}
This class is called JournalNode now. We no longer have a state machine here - 
the recovery process/syncing process is coordinated by the NameNode/client 
side, and the commit protocol ensures that every segment is always finalized on 
a majority of nodes. So the state machine isn't necessary to ensure that all 
the log segments are replicated successfully.

{code}
 .../hadoop/hdfs/server/namenode/FSEditLog.java     |   65 ++++-
 .../hadoop/hdfs/server/namenode/FSImage.java       |    4 +-
{code}
The new functions added here were necessary for the synchronization process 
above, but not necessary for the recovery process implemented by 3077.

{code}
 .../hdfs/server/namenode/GetImageServlet.java      |   10 +-
 .../hadoop/hdfs/server/namenode/NNStorage.java     |    4 +-
 .../hdfs/server/namenode/TransferFsImage.java      |    4 +-
{code}

Just changes to make things public -- same changes in 3077 patch.

{code}
 .../hadoop/hdfs/server/namenode/JournalSet.java    |   34 ++
{code}
Since the Journal talks directly to FJM in 3077, the {{getFinalizedSegments}} 
function added here is instead fulfilled by making the existing function 
{{FileJournalManager.getLogFiles()}} public.

{code}
 .../org/apache/hadoop/hdfs/MiniDFSCluster.java     |    9 +
{code}
The 3077 branch has a MiniJournalCluster which can start/stop/restart 
JournalNodes. The change here in the 3092 branch appears to be incomplete -- it 
adds functions which are never called.

{code}
 .../hdfs/server/journalservice/TestJournal.java    |   71 ++++
 .../journalservice/TestJournalHttpServer.java      |  311 +++++++++++++++++
 .../server/journalservice/TestJournalService.java  |  150 +++++++--
{code}

Several of these tests are carried over into similarly named tests in 3077. 
Others didn't make sense due to changes mentioned above. The test coverage in 
the patch attached here is comparable to the coverage in 3092, and I'm working 
on adding a lot more tests currently.


So, to summarize the key similarities and differences:
- most of the HTTP server, RPC server, node wrapper code, JSP pages, journal 
HTTP servlet carried over
-- SPNEGO support instead of KSSL
-- Lifecycle code a little different in order to work with MiniJournalCluster
- the edits "synchronization" code is mostly removed, since the synchronization 
is now NN-led. If we feel strongly that the JNs should synchronize "old" log 
segments, instead of just ensuring that every segment is stored on a quorum, we 
can bring some of this back. But we're already guaranteed that every segment 
has two replicas by this design.
- new JNStorage class, since the JN doesn't actually share most of its storage 
code with the NameNode (eg we have no images, no checkpoints, no distributed 
upgrade coordination, etc
- Journal updated to use above JNStorage class instead of NNStorage and 
FSEditLog
- Not attempting to share the RPC protocol with BackupNode or NameNode, since 
we need quorum-specific information like - Not attempting to share the RPC 
protocol with BackupNode or NameNode, since we need quorum-specific information 
like epoch numbers in every RPC, and those don't make sense in the other 
contexts.

                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3077-partial.txt, hdfs-3077.txt, 
> qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Reply via email to