[
https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021017#comment-14021017
]
Konstantin Shvachko commented on HDFS-6469:
-------------------------------------------
Suresh, your questions are right to the point and call for a detailed response.
h4. > adding all this complexity into HDFS
Well we added all the other complexity into HDFS already...
In fact it is really simple:
# Instead of applying write requests to the NameNode directly we first send
them as proposals for coordination
# Once the order of the update is agreed upon we apply it to the NameNode
# Then we make sure the updates on NN are deterministic.
The rest of the design document is about how we don't need to do anything since
it is already deterministic, like in sequential block id generation, or how to
change current behavior to make it deterministic, like in lease recovery.
All the complexity is "hidden" in the distributed coordination engine.
h4. > advantages
The immediate advantages for HDFS are
* active-active HA, which means simplicity: no failover and no fencing.
* load balancing between replicated NameNodes, which is especially effective
for read intensive workloads
Longer-term we will be able to introduce
* cross-datacenter support for DR,
* distributed transactions,
* distributed namespaces
h4. > and disadvantages (slowdown due to consensus) of this solution.
I don't know of any disadvantages. But your question is probably about the
performance.
I am not in a position to publish exact numbers. When we have
ZKCoordinationEnginge ready we can run benchmarks. I can say the following:
# I don't see any degradation in running DFSIO or teragen (write workload) with
single NN vs. three coordinated CNodes.
# There is no difference in running NNThroughputBenchmark on a single CNode
(without coordination) vs a standard NameNode. No coordination, because it is
the nature of NNThroughputBenchmark that it bypasses RPC, which the
coordination is tied to and therefore is also bypassed.
# New microbenchmark is needed to capture potential overhead of coordination.
Conceptually I do not see degradation in throughput of CNode write operations.
We should expect degradation in latency of individual operations, but there are
different known ways to optimize it.
Saying this, I do not rule out that initially the performance of CNode may be
bad, but optimizations will be developed similar to how we optimized current
NameNode.
h4. > what happens when a change cannot be journaled after consensus is reached
and memory is updated.
The short answer is that it is the same as standard NN.
Journals are local for each CNode. If a journal stream fails it is excluded
from further journaling. When CNode runs out of journal streams it fails and is
brought down, not different from any other failure. The rest of CNodes continue
normal operations as long as they form a quorum.
An interesting aspect here that edits are optional for CNodes and can be
completely eliminated for performance reasons. Recent transactions are anyways
stored in CE. And as long as at least one CNode is alive it is the source of
the current namespace state.
h4. > How are requests from other clients handled when a request is being
processed?
Clients can submit proposals simultaneously, then they wait until the agreement
is executed. Same as RPC.
h4. > sighted as bottleneck currently, the namenode single lock, much worser?
It is a different discussion whether single lock is a bottleneck of current NN.
Last I investigated this it was not. The bottleneck was the journaling io. Also
it was noticed that the real cluster demand was an order of magnitude lower
than the NN capacity.
CNodes does not make it worse. We used the same optimization of batching edits
for syncing them together before replying to the client so that agreements
could be executed without waiting for syncs.
h4. > abandon BackupNode based solutions?
Not sure which _solutions_, but there will be no more need to keep BackupNode
class from my perspective once CNode is in.
> Coordinated replication of the namespace using ConsensusNode
> ------------------------------------------------------------
>
> Key: HDFS-6469
> URL: https://issues.apache.org/jira/browse/HDFS-6469
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: namenode
> Affects Versions: 3.0.0
> Reporter: Konstantin Shvachko
> Assignee: Konstantin Shvachko
> Attachments: CNodeDesign.pdf
>
>
> This is a proposal to introduce ConsensusNode - an evolution of the NameNode,
> which enables replication of the namespace on multiple nodes of an HDFS
> cluster by means of a Coordination Engine.
--
This message was sent by Atlassian JIRA
(v6.2#6252)