[
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473384#comment-13473384
]
Todd Lipcon commented on HDFS-3077:
-----------------------------------
bq. I did not understand this well. Why are we retrying any request to
JournalNodes? Given most of the requests are not idempotent and cannot be
retried why is this an advantage?
Currently we don't retry most requests, but it would actually be easy to
support retries in most cases, because the client always sends a unique <epoch,
ipc serial number> in each call. If the server receives the same epoch and
serial number twice in a row, it can safely re-respond with the previous
response. This is not true for the newEpoch calls, because this is where we
_enforce_ the unique epoch ID.
As for _why_ we'd want to retry, it seems useful to be able to do so after a
small network blip between NN and JNs, for example.
{quote}
Recover all transactions. We do this in the same fashion as ZAB rather then the
way you suggested "the NN can run the recovery process for each of these
earlier segments individually". Note this requires two changes:
- The protocol message contents change - the response to phase 1 is highest
txid and not highest segment's txid. The JN recovers all previous transactions
from another JN.
- When a JN joins an existing writer it first syncs previous segments.
{quote}
I'll agree to work on the improvement where we recover previous segments, but I
disagree that it should be done as part of the recovery phase. Here are my
reasons:
- Currently with local journals, we don't do this. When an edits directory
crashes and becomes available again, we start writing to it without first
re-copying all previous segments back into it. This has worked fine for us. So
I don't think we need a stronger guarantee on the JN.
- The NN may maintain a few GB worth of edits due to various retention
policies. If a JN crashes and is reformatted, then this would imply that the JN
has to copy multiple GB worth of data from another JN before it can actively
start participating as a destination for new logs. This will take quite some
time.
- Furthermore, because the JN will be syncing its logs from another JN, we need
to make sure the copying is throttled. Otherwise, the restart of a JN will suck
up disk and network bandwidth from the other JN which is trying to provide low
latency logging for the active namenode. If we didn't throttle it, the transfer
and disk IO would balloon the latency for the namespace significantly, which I
think it's best to avoid. If we do throttle it (say to 10MB/sec), then syncing
several GB of logs will take several minutes, during which time the fault
tolerance of the cluster is compromised.
- Similar to the above, if there are several GB of logs to synchronize, this
will impact NN startup (or failover) time a lot.
I think, instead, the synchronization should be done as a background thread:
- The thread periodically wakes up and scans for any in-progress segments or
gaps in the local directory
- If it finds one (and it is not the highest numbered segment), then it starts
the synchronization process.
-- We reuse existing code to RPC to the other nodes and find one which has the
finalized segment, and copy it using the transfer throttler to avoid impacting
latency of either the sending or receiving node.
bq. Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB
I still fail to see why it cannot work for us. I think because of (1), merging
the two steps will no longer be an issue.
I still don't understand why this is _better_ and not just _different_. If you
and Suresh want to make the change, it's OK by me, but I expect that you will
re-run the same validation before committing (eg run the randomized fault test
a few hundred thousand times). This testing found a bunch of errors in the
design before, so any chance to the design should go through the same test
regimen to make sure we aren't missing some subtlety.
If the above sounds good to you, let's file the follow-up JIRAs and merge?
Thanks.
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: ha, name-node
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Fix For: QuorumJournalManager (HDFS-3077)
>
> Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt,
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf,
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf,
> qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on
> shared storage such as an NFS filer for the shared edit log. One alternative
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject
> which provides a highly available replicated edit log on commodity hardware.
> This JIRA is to implement another alternative, based on a quorum commit
> protocol, integrated more tightly in HDFS and with the requirements driven
> only by HDFS's needs rather than more generic use cases. More details to
> follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira