[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230629#comment-13230629
 ] 

Suresh Srinivas commented on HDFS-3077:
---------------------------------------

bq. but like Einstein said, no simpler!
Its all relative :-)

BTW it would be good write design for this. That avoid lenghty comments and 
keeps the summary of what is proposed in place, instead of scattering in 
multiple comments.

bq. This is mostly great – so long as you have an external fencing strategy 
which prevents the old active from attempting to continue to write after the 
new active is trying to read. 
External fencing is not needed, given active daemons having ability to fence.

bq. it gets the loggers to promise not to accept edits from the old active
The daemons can stop accepting writes when it realizes that active lock is no 
longer held by the writer. Clearly an advantage of an active daemon compared to 
using passive storage.

bq. But, we still have one more problem: given some txid N, we might have 
multiple actives that have tried to write the same transaction ID. Example 
scenario:
The case of writes making it though some daemons can also be solved. The writes 
that have made through W daemons wins. The others are marked not in sync and 
need to sync up. Explanation to follow.

The solution we are building is specific to namenode editlogs. There is only 
one active writer (as Ivan brought up earlier). Here is the outline I am 
thinking of.

Lets start with steady state with K of N journal deamons. When a journal daemon 
fails, we roll the edits. When a journal daemon joins, we roll the edits. New 
journal daemon could start syncing other finalized edits, while keeping track 
of edits in progress. We also keep track of the list of the active daemons in 
zookeeper. Rolling gives a logical point for newly joined daemon to sync up 
(sort of like generation stamp). 
During failover, the new active, gets from the actively written journals, the 
point to which it has to sync up to. It then rolls the edits also to that 
point. Rolling also gives you a way to discard extra journal records that made 
it to < W daemons, during failover. When there are overlapping records, say 
e1-105 and e100-200, you read 100-105 from the second editlog, and discard it 
from the first editlog.

Again there are scenarios that are missing here. I plan to post more details in 
a design on this.
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to