[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473384#comment-13473384
 ] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. I did not understand this well. Why are we retrying any request to 
JournalNodes? Given most of the requests are not idempotent and cannot be 
retried why is this an advantage?

Currently we don't retry most requests, but it would actually be easy to 
support retries in most cases, because the client always sends a unique <epoch, 
ipc serial number> in each call. If the server receives the same epoch and 
serial number twice in a row, it can safely re-respond with the previous 
response. This is not true for the newEpoch calls, because this is where we 
_enforce_ the unique epoch ID.

As for _why_ we'd want to retry, it seems useful to be able to do so after a 
small network blip between NN and JNs, for example.

{quote}
Recover all transactions. We do this in the same fashion as ZAB rather then the 
way you suggested "the NN can run the recovery process for each of these 
earlier segments individually". Note this requires two changes:
- The protocol message contents change - the response to phase 1 is highest 
txid and not highest segment's txid. The JN recovers all previous transactions 
from another JN.
- When a JN joins an existing writer it first syncs previous segments.
{quote}

I'll agree to work on the improvement where we recover previous segments, but I 
disagree that it should be done as part of the recovery phase. Here are my 
reasons:
- Currently with local journals, we don't do this. When an edits directory 
crashes and becomes available again, we start writing to it without first 
re-copying all previous segments back into it. This has worked fine for us. So 
I don't think we need a stronger guarantee on the JN.
- The NN may maintain a few GB worth of edits due to various retention 
policies. If a JN crashes and is reformatted, then this would imply that the JN 
has to copy multiple GB worth of data from another JN before it can actively 
start participating as a destination for new logs. This will take quite some 
time.
- Furthermore, because the JN will be syncing its logs from another JN, we need 
to make sure the copying is throttled. Otherwise, the restart of a JN will suck 
up disk and network bandwidth from the other JN which is trying to provide low 
latency logging for the active namenode. If we didn't throttle it, the transfer 
and disk IO would balloon the latency for the namespace significantly, which I 
think it's best to avoid. If we do throttle it (say to 10MB/sec), then syncing 
several GB of logs will take several minutes, during which time the fault 
tolerance of the cluster is compromised.
- Similar to the above, if there are several GB of logs to synchronize, this 
will impact NN startup (or failover) time a lot.

I think, instead, the synchronization should be done as a background thread:
- The thread periodically wakes up and scans for any in-progress segments or 
gaps in the local directory
- If it finds one (and it is not the highest numbered segment), then it starts 
the synchronization process.
-- We reuse existing code to RPC to the other nodes and find one which has the 
finalized segment, and copy it using the transfer throttler to avoid impacting 
latency of either the sending or receiving node.

bq. Lets merge the newEpoch and prepareRecovery. Given that this works for ZAB 
I still fail to see why it cannot work for us. I think because of (1), merging 
the two steps will no longer be an issue.

I still don't understand why this is _better_ and not just _different_. If you 
and Suresh want to make the change, it's OK by me, but I expect that you will 
re-run the same validation before committing (eg run the randomized fault test 
a few hundred thousand times). This testing found a bunch of errors in the 
design before, so any chance to the design should go through the same test 
regimen to make sure we aren't missing some subtlety.


If the above sounds good to you, let's file the follow-up JIRAs and merge? 
Thanks.
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to