[ 
https://issues.apache.org/jira/browse/HDFS-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258512#comment-13258512
 ] 

Todd Lipcon commented on HDFS-3092:
-----------------------------------

Can you clarify a few things in this document?

- In ParallelWritesWithBarrier, what happens to the journals which 
timeout/fail? It seems you need to mark them as failed in ZK or something in 
order to be correct. But if you do that, why do you need Q to be a "quorum"? 
Q=1 should suffice for correctness, and Q=2 should suffice in order to always 
be available to recover.

It seems the protocol should be closer to:
1) send out write request to all active JNs
2) wait until all respond, or a configurable timeout
3) any that do not respond are marked as failed in ZK
4) If the remaining number of JNs is sufficient (I'd guess 2) then succeed the 
write. Otherwise fail the write and abort.

The recovery protocol here is also a little tricky. I haven't seen a 
description of the specifics - there are a number of cases to handle - eg even 
if a write appears to fail from the perspective of the writer, it may have 
actually succeeded. Another situation: what happens if the writer crashes 
between step 2 and step 3 (so the JNs have differing number of txns, but ZK 
indicates they're all up to date?) 


Regarding quorum commits:
bq. b. The journal set is fixed in the config. Hard to add/replace hardware.
There are protocols that could be used to change the quorum size/membership at 
runtime. They do add complexity, though, so I think they should be seen as a 
future improvement - but not be discounted as impossible.
Another point is that hardware replacement can easily be treated the same as a 
full crash and loss of disk. If one node completely crashes, a new node could 
be brought in with the same hostname with no complicated protocols.
Adding or removing nodes shouldn't be hard to support during a downtime window, 
which I think satisfies most use cases pretty well.


Regarding bookkeeper:
- other operational concerns aren't mentioned: eg it doesn't use Hadoop 
metrics, doesn't use the same style of configuration files, daemon scripts, 
etc. 
                
> Enable journal protocol based editlog streaming for standby namenode
> --------------------------------------------------------------------
>
>                 Key: HDFS-3092
>                 URL: https://issues.apache.org/jira/browse/HDFS-3092
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ha, name-node
>    Affects Versions: 0.24.0, 0.23.3
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>         Attachments: ComparisonofApproachesforHAJournals.pdf, 
> MultipleSharedJournals.pdf, MultipleSharedJournals.pdf, 
> MultipleSharedJournals.pdf
>
>
> Currently standby namenode relies on reading shared editlogs to stay current 
> with the active namenode, for namespace changes. BackupNode used streaming 
> edits from active namenode for doing the same. This jira is to explore using 
> journal protocol based editlog streams for the standby namenode. A daemon in 
> standby will get the editlogs from the active and write it to local edits. To 
> begin with, the existing standby mechanism of reading from a file, will 
> continue to be used, instead of from shared edits, from the local edits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to