[ https://issues.apache.org/jira/browse/HDFS-599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868745#action_12868745 ]
Sanjay Radia commented on HDFS-599: ----------------------------------- Having all protocols serve on all ports is strange and not very standard practice. However I do agree with the use cases that during startup or during a period of high load on the NN, an Admin may want to issues a standard NN operation and ensure that it gets served promptly and perhaps with priority. I agree with Dhruba that breaking the client protocol in 2 parts is questionable and IMHO architecturally not clean (imagine explaining to someone why we split the client protocol into 2 parts). There are two solution here. One is to give priority to certain users (this is very complex and I don't recommend doing it). The other is to extend Hadoop's existing Service ACL: The service ACL specifies the protocols and the list of users and groups that are allowed to access the protocol. I suggest a separate jira to extend the Service ACL to optionally specify a port in addition to the protocol name. Dmytro, I request that you also complete this other Jira independently in the spirit of providing a clean comprehensive solution to the problem of multiple protocols on multiple ports. > Improve Namenode robustness by prioritizing datanode heartbeats over client > requests > ------------------------------------------------------------------------------------ > > Key: HDFS-599 > URL: https://issues.apache.org/jira/browse/HDFS-599 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Reporter: dhruba borthakur > Assignee: Dmytro Molkov > Attachments: HDFS-599.patch > > > The namenode processes RPC requests from clients that are reading/writing to > files as well as heartbeats/block reports from datanodes. > Sometime, because of various reasons (Java GC runs, inconsistent performance > of NFS filer that stores HDFS transacttion logs, etc), the namenode > encounters transient slowness. For example, if the device that stores the > HDFS transaction logs becomes sluggish, the Namenode's ability to process > RPCs slows down to a certain extent. During this time, the RPCs from clients > as well as the RPCs from datanodes suffer in similar fashion. If the > underlying problem becomes worse, the NN's ability to process a heartbeat > from a DN is severly impacted, thus causing the NN to declare that the DN is > dead. Then the NN starts replicating blocks that used to reside on the > now-declared-dead datanode. This adds extra load to the NN. Then the > now-declared-datanode finally re-establishes contact with the NN, and sends a > block report. The block report processing on the NN is another heavyweight > activity, thus casing more load to the already overloaded namenode. > My proposal is tha the NN should try its best to continue processing RPCs > from datanodes and give lesser priority to serving client requests. The > Datanode RPCs are integral to the consistency and performance of the Hadoop > file system, and it is better to protect it at all costs. This will ensure > that NN recovers from the hiccup much faster than what it does now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.