[ 
https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021017#comment-14021017
 ] 

Konstantin Shvachko commented on HDFS-6469:
-------------------------------------------

Suresh, your questions are right to the point and call for a detailed response.

h4. >  adding all this complexity into HDFS

Well we added all the other complexity into HDFS already...
In fact it is really simple:
# Instead of applying write requests to the NameNode directly we first send 
them as proposals for coordination
# Once the order of the update is agreed upon we apply it to the NameNode
# Then we make sure the updates on NN are deterministic.

The rest of the design document is about how we don't need to do anything since 
it is already deterministic, like in sequential block id generation, or how to 
change current behavior to make it deterministic, like in lease recovery.
All the complexity is "hidden" in the distributed coordination engine.

h4. > advantages
The immediate advantages for HDFS are
* active-active HA, which means simplicity: no failover and no fencing.
* load balancing between replicated NameNodes, which is especially effective 
for read intensive workloads

Longer-term we will be able to introduce 
* cross-datacenter support for DR,
* distributed transactions,
* distributed namespaces

h4. > and disadvantages (slowdown due to consensus) of this solution.

I don't know of any disadvantages. But your question is probably about the 
performance.
I am not in a position to publish exact numbers. When we have 
ZKCoordinationEnginge ready we can run benchmarks. I can say the following:
# I don't see any degradation in running DFSIO or teragen (write workload) with 
single NN vs. three coordinated CNodes.
# There is no difference in running NNThroughputBenchmark on a single CNode 
(without coordination) vs a standard NameNode. No coordination, because it is 
the nature of NNThroughputBenchmark that it bypasses RPC, which the 
coordination is tied to and therefore is also bypassed.
# New microbenchmark is needed to capture potential overhead of coordination. 
Conceptually I do not see degradation in throughput of CNode write operations. 
We should expect degradation in latency of individual operations, but there are 
different known ways to optimize it.
Saying this, I do not rule out that initially the performance of CNode may be 
bad, but optimizations will be developed similar to how we optimized current 
NameNode.

h4. > what happens when a change cannot be journaled after consensus is reached 
and memory is updated.

The short answer is that it is the same as standard NN.
Journals are local for each CNode. If a journal stream fails it is excluded 
from further journaling. When CNode runs out of journal streams it fails and is 
brought down, not different from any other failure. The rest of CNodes continue 
normal operations as long as they form a quorum.
An interesting aspect here that edits are optional for CNodes and can be 
completely eliminated for performance reasons. Recent transactions are anyways 
stored in CE. And as long as at least one CNode is alive it is the source of 
the current namespace state.

h4. > How are requests from other clients handled when a request is being 
processed?

Clients can submit proposals simultaneously, then they wait until the agreement 
is executed. Same as RPC.

h4. > sighted as bottleneck currently, the namenode single lock, much worser?

It is a different discussion whether single lock is a bottleneck of current NN. 
Last I investigated this it was not. The bottleneck was the journaling io. Also 
it was noticed that the real cluster demand was an order of magnitude lower 
than the NN capacity. 
CNodes does not make it worse. We used the same optimization of batching edits 
for syncing them together before replying to the client so that agreements 
could be executed without waiting for syncs.

h4. > abandon BackupNode based solutions?

Not sure which _solutions_, but there will be no more need to keep BackupNode 
class from my perspective once CNode is in.

> Coordinated replication of the namespace using ConsensusNode
> ------------------------------------------------------------
>
>                 Key: HDFS-6469
>                 URL: https://issues.apache.org/jira/browse/HDFS-6469
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: CNodeDesign.pdf
>
>
> This is a proposal to introduce ConsensusNode - an evolution of the NameNode, 
> which enables replication of the namespace on multiple nodes of an HDFS 
> cluster by means of a Coordination Engine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to