[ 
https://issues.apache.org/jira/browse/HDFS-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245112#comment-13245112
 ] 

Flavio Junqueira commented on HDFS-3092:
----------------------------------------

Hi Suresh, Thanks for sharing a design document. I have a few comments and 
questions if you don't mind:

# I find this design to be very close to bookkeeper, with a few important 
differences. One noticeable difference that has been mentioned elsewhere is 
that bookies implement mechanisms to enable high performance when there are 
multiple concurrent ledgers being written to. Your design does not seem to 
consider the possibility of multiple concurrent logs, which you may want to 
have for federation. Federation will be useful for large deployments, but not 
for small deployments. It sounds like a good idea to have a solution that 
covers both cases.  
# There has been comments about comparing the different approaches discussed, 
and I was wondering what criteria you have been thinking of using to compare 
them. I guess it can't be performance because as the numbers Ivan has generated 
show, the current bottleneck is the namenode code, not the logging. Until the 
existing bottlenecks in the namenode code are removed, having a fast logging 
mechanism won't make much difference with respect to throughput. 
# I was wondering about how reads to the log are executed if writes only have 
to reach a majority quorum. Once it is time to read, how does the reader gets a 
consistent view of the log? One JD alone may not have all entries, so I suppose 
the reader may need to read from multiple JDs to get a consistent view? Do the 
transaction identifiers establish the order of entries in the log? One quick 
note is that I don't see why a majority is required; bk does not require a 
majority.

Here are some notes I took comparing the bk approach with the one in this jira, 
in the case you're interested:

# *Rolling*: The notion of rolling here is equivalent to closing a ledger and 
creating a new one. As ledgers are identified with numbers that are 
monotonically increasing, the ledger identifiers can be used to order the 
sequence of logs created over time.
# *Single writer*: Only one client can add new entries to a ledger. We have the 
notion of a recovery client, which is essentially a reader that makes sure that 
there is agreement on the end of the ledger. Such a recovery client is also 
able to write entries, but these writes are simply to make sure that there is 
enough replication.
# *Fencing*: We fence ledgers individually, so that we guarantee that all 
operations a ledger writer returns successfully are persisted on enough 
bookies. This is different from the approach proposed here, which essentially 
fences logging as a whole.   
# *Split brain*: In a split-brain situation, bk can have two writers each 
writing to a different ledger. However, my understanding is that a namenode 
that is failing over cannot make progress without reading the previous log 
(ledger), consequently this situation cannot occur with bk and we don’t require 
writes to a majority for correctness. 
# *Adding JDs*: The mechanism described here mentions explicitly adding a new 
JD. My understanding is that a new JD is brought up and it is told somehow to 
connect to the namenode and to another JD in the JournalList to sync up. bk 
currently only picks bookies from a pool of available bookies through 
zookeeper. It shouldn’t be a problem to allow a fixed list of bookies to be 
passed upon creating a ledger.
# *Striping*: bk implements striping, although that’s an optional feature. It 
is possible to use a configuration like 2-2 or 3-3 (Q-N, Q=quorum size and 
N=ensemble size).  
# *Failure detection*: bk uses zookeeper ephemeral nodes to track bookies that 
are available. A client also changes its ensemble view if it loses a bookie by 
adding a new bookie. I’m not exactly sure how you monitor crashes here. Is it 
the namenode that keeps track of which JDs in the JournalList are available?

   
                
> Enable journal protocol based editlog streaming for standby namenode
> --------------------------------------------------------------------
>
>                 Key: HDFS-3092
>                 URL: https://issues.apache.org/jira/browse/HDFS-3092
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ha, name-node
>    Affects Versions: 0.24.0, 0.23.3
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>         Attachments: MultipleSharedJournals.pdf
>
>
> Currently standby namenode relies on reading shared editlogs to stay current 
> with the active namenode, for namespace changes. BackupNode used streaming 
> edits from active namenode for doing the same. This jira is to explore using 
> journal protocol based editlog streams for the standby namenode. A daemon in 
> standby will get the editlogs from the active and write it to local edits. To 
> begin with, the existing standby mechanism of reading from a file, will 
> continue to be used, instead of from shared edits, from the local edits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to