[ 
https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026135#comment-14026135
 ] 

Todd Lipcon commented on HDFS-6469:
-----------------------------------

I have a few concerns about this:

h3. Fine grained locking
As Suresh mentioned above, I'm concerned that the consensus engine must fully 
serialize all write operations into the namespace. Though we currently already 
do this by means of the FSN lock, there is already work ongoing by Daryn and 
others over at Yahoo to try to make the locking more fine-grained. If the 
consensus engine fully serializes everything into a single request stream, 
won't it become somewhat more difficult to get fine grained locking working?

Based on what I've seen in many production clusters, the single lock and the 
RPC system _are_ the bottleneck for many workloads, especially when some users 
perform heavy operations like removal of a large directory tree or listing a 
large dir.

h3. Coordinated reads
{quote}
We assume that 
only a few special files, like job.xml, should be exposed to coordinated reads. 
Therefore, an 
administrator can configure a set of patterns, which is recognized by a CNode, 
and when it 
sees a file name matching the configured pattern it initiates a coordinated 
read for that file. 
{quote}

This behavior makes me fairly nervous. It assumes that administrators know the 
full set of applications that will run on the cluster and demand consistency. I 
don't think this is always a fair assumption, and the behavior when not 
configured correctly will be very difficult to debug or diagnose.

h3. Double journaling
Another issue with this design is that it doubles the amount of journaling 
required. The consensus engine, to properly implement something like Paxos, 
needs to keep one journal, while the NN keeps another. These might be put on 
separate disks to minimize seeks, but that does imply that latency will be 
doubled as well. I'm surprised that you haven't seen this in your benchmarks, 
unless you're running on a very fast device like PCIe flash.

h3. Non-determinism
Section 5 of the design doc talks about the many places in which the NameNode 
is not currently fully deterministic. As you've described, there is a fair 
amount of work necessary to fix these things (eg ensuring that block placement 
decisions agree even if the heartbeats from datanodes come at slightly 
different times). This determinism will become even more complex once we make 
the locking more fine-grained -- even something as simple as sequential block 
ID generation is no longer trivial to make deterministic across nodes, since 
two operations on distinct parts of the namespace may grab the ID in an 
unspecified order.

I'm wary that this determinism requirement will be a significant maintenance 
burden on HDFS development in the future.

h3. Questions about the advantages
- Could you explain further why this design makes it any easier to implement a 
distributed namespace? It seems fully orthogonal to me, and in fact due to the 
above determinism issues may hamper the ease with which we could scale up/out 
the namenode implementation.

----
h3. Comparison vs an alternate design?
 
Overall, I'm wondering what advantages this approach might have over an 
alternate approach that builds on the design we've already got. For example, 
consider the following:
- enable the configuration boolean that enables read from standby. This opens 
the same can of worms as the ConsensusNode in that we need to have some sort of 
way to ensure consistent reads when reading from slaves vs the leader. But, in 
the same way that you're proposing solving it, the same solutions should work 
for the JN-based approach.
- add a small amount of code to the JN to allow a reader to "tail" the 
committed edits stream. (eg a new servlet which uses chunked encoding to 
provide edit log "subscribers")
- change the SBN EditLogTailer to use the above interface instead of only 
reading from rolled segments.

I think these changes would be less disruptive than adding a new NameNode 
subclass, and give users the same benefits as you're proposing here. 
Additionally, there would be no double-logging overhead, and we wouldn't have 
to worry about non-determinism in the implementation. Lastly, a fully usable 
solution would be available to the community at large, whereas the design 
you're proposing seems like it will only be usably implemented by a proprietary 
extension (I don't consider the ZK "reference implementation" likely to 
actually work in a usable fashion).


> Coordinated replication of the namespace using ConsensusNode
> ------------------------------------------------------------
>
>                 Key: HDFS-6469
>                 URL: https://issues.apache.org/jira/browse/HDFS-6469
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: CNodeDesign.pdf
>
>
> This is a proposal to introduce ConsensusNode - an evolution of the NameNode, 
> which enables replication of the namespace on multiple nodes of an HDFS 
> cluster by means of a Coordination Engine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to