[
https://issues.apache.org/jira/browse/HDFS-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245112#comment-13245112
]
Flavio Junqueira commented on HDFS-3092:
----------------------------------------
Hi Suresh, Thanks for sharing a design document. I have a few comments and
questions if you don't mind:
# I find this design to be very close to bookkeeper, with a few important
differences. One noticeable difference that has been mentioned elsewhere is
that bookies implement mechanisms to enable high performance when there are
multiple concurrent ledgers being written to. Your design does not seem to
consider the possibility of multiple concurrent logs, which you may want to
have for federation. Federation will be useful for large deployments, but not
for small deployments. It sounds like a good idea to have a solution that
covers both cases.
# There has been comments about comparing the different approaches discussed,
and I was wondering what criteria you have been thinking of using to compare
them. I guess it can't be performance because as the numbers Ivan has generated
show, the current bottleneck is the namenode code, not the logging. Until the
existing bottlenecks in the namenode code are removed, having a fast logging
mechanism won't make much difference with respect to throughput.
# I was wondering about how reads to the log are executed if writes only have
to reach a majority quorum. Once it is time to read, how does the reader gets a
consistent view of the log? One JD alone may not have all entries, so I suppose
the reader may need to read from multiple JDs to get a consistent view? Do the
transaction identifiers establish the order of entries in the log? One quick
note is that I don't see why a majority is required; bk does not require a
majority.
Here are some notes I took comparing the bk approach with the one in this jira,
in the case you're interested:
# *Rolling*: The notion of rolling here is equivalent to closing a ledger and
creating a new one. As ledgers are identified with numbers that are
monotonically increasing, the ledger identifiers can be used to order the
sequence of logs created over time.
# *Single writer*: Only one client can add new entries to a ledger. We have the
notion of a recovery client, which is essentially a reader that makes sure that
there is agreement on the end of the ledger. Such a recovery client is also
able to write entries, but these writes are simply to make sure that there is
enough replication.
# *Fencing*: We fence ledgers individually, so that we guarantee that all
operations a ledger writer returns successfully are persisted on enough
bookies. This is different from the approach proposed here, which essentially
fences logging as a whole.
# *Split brain*: In a split-brain situation, bk can have two writers each
writing to a different ledger. However, my understanding is that a namenode
that is failing over cannot make progress without reading the previous log
(ledger), consequently this situation cannot occur with bk and we don’t require
writes to a majority for correctness.
# *Adding JDs*: The mechanism described here mentions explicitly adding a new
JD. My understanding is that a new JD is brought up and it is told somehow to
connect to the namenode and to another JD in the JournalList to sync up. bk
currently only picks bookies from a pool of available bookies through
zookeeper. It shouldn’t be a problem to allow a fixed list of bookies to be
passed upon creating a ledger.
# *Striping*: bk implements striping, although that’s an optional feature. It
is possible to use a configuration like 2-2 or 3-3 (Q-N, Q=quorum size and
N=ensemble size).
# *Failure detection*: bk uses zookeeper ephemeral nodes to track bookies that
are available. A client also changes its ensemble view if it loses a bookie by
adding a new bookie. I’m not exactly sure how you monitor crashes here. Is it
the namenode that keeps track of which JDs in the JournalList are available?
> Enable journal protocol based editlog streaming for standby namenode
> --------------------------------------------------------------------
>
> Key: HDFS-3092
> URL: https://issues.apache.org/jira/browse/HDFS-3092
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: ha, name-node
> Affects Versions: 0.24.0, 0.23.3
> Reporter: Suresh Srinivas
> Assignee: Suresh Srinivas
> Attachments: MultipleSharedJournals.pdf
>
>
> Currently standby namenode relies on reading shared editlogs to stay current
> with the active namenode, for namespace changes. BackupNode used streaming
> edits from active namenode for doing the same. This jira is to explore using
> journal protocol based editlog streams for the standby namenode. A daemon in
> standby will get the editlogs from the active and write it to local edits. To
> begin with, the existing standby mechanism of reading from a file, will
> continue to be used, instead of from shared edits, from the local edits.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira