[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Todd Lipcon (JIRA) Fri, 22 Jun 2012 17:38:45 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399757#comment-13399757
 ]


Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. if the JournalNode who already received a prepare RPC with higher newEpoch 
number (possible?) can inform the writer(proposer), the writer can exit earlier 
in step 2 "Choosing a recovery".

Yes, it would reject such RPCs. Once it receives newEpoch(N), then it won't 
accept any RPCs from any writer with a lower epoch number (the normal Paxos 
promise guarantee)

bq. In step 3 "Accept RPC", I assume the URL that the writer sends to all the 
JNs is the URL of one JN which responded in step2. If that JN becomes 
inaccessible immediately, and thus other JNs can't sync themselves by 
downloading the finalized segment from that JN, the recovery process could be 
stuck?

More or less, yes. It's a potential improvement to actually send a list of URLs 
here -- and include any who already have the correct segment in sync.

IMO, recovery needs to be tolerant of missing nodes, but I don't think we need 
it to be tolerant of nodes crashing in the middle of the recovery process -- 
it's OK for the NN to bail out of startup in that case, so long as they don't 
leave the system in an unrecoverable state. The "retry" would be done by trying 
again to start the NN. I'll try to address this in the next rev of the 
document. It would be future work to add a retry loop around the recovery 
process so that it would be tolerant of this.

My thinking for the above is that log segments tend to be short -- in an HA 
setup we roll every 2 minutes. So even in a heavily loaded cluster, segments 
tend to be quite small. I just looked on a 100-node QA cluster here running HA 
and a constant workload of terasort, gridmix, etc, and the largest edit log is 
3.1MB. So, the recovery process should complete in far less than 1 second given 
the transfer time for such a segment would be <~50ms. Hence the chances of a 
crash during this timeframe are vanishingly small.

bq. If it could be stuck in step 3, an alternate way to sync lagging JNs is to 
let them contact other quorum number JNs to download the finalized segments. 
Given the requirement "All loggers must finalize the segment to the same length 
and contents", all the finalized segment with the same name should be identical 
in all JNs. Therefore, the lagging JN can download it from any other JN as long 
as that JN has the file.

Yep, per above I think we can extend to a list of all possible JNs which are 
known to be up-to-date with the sync txid. But I think it's a good future 
enhancement rather than a requirement for right now. Does my reasoning make 
sense?
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3077-partial.txt, qjournal-design.pdf, 
> qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Reply via email to