[ https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477131#comment-13477131 ]
Eli Collins commented on HDFS-3990: ----------------------------------- bq. They will change when a pre-existing node, say one with the same storage id, is updated with the new info. I'm not sure re-registering with a new IP and the same storage ID actually works today. bq. The patch appears to change the way the include and exclude work by trusting who the datanode claims to be. What if a datanode "lies" about who it is? Or if a dns hiccup occurs when the datanode is going to register? It sends its name as an ip, but the exclude list only has hosts. There are a number of scenarios where a datanode could bypass the include/exclude list, which is why we should never trust the client. Take another look at the patch, the NN is doing the lookup not the DN, just at registration time. How about we reject the DN registration in case of a DNS hiccup (rather than use the DN value which the patch currently does in this case)? The DN will retry until it succeeds. When working on HDFS-3171 I considered removing the ability for the DN to override the hostname, and have just one lookup per DN (ie currently both the NN and DN resolve the DN hostname). We could open a separate jira for that, might be easier to layer this one atop it. I'm against having DatanodeID fields that duplicates the other fields since I think we can solve the problem here and avoid doing so. My experience from HDFS-3144 indicates we will introduce bugs and it's hard to correctly untangle later. > NN's health report has severe performance problems > -------------------------------------------------- > > Key: HDFS-3990 > URL: https://issues.apache.org/jira/browse/HDFS-3990 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt > > > The dfshealth page will place a read lock on the namespace while it does a > dns lookup for every DN. On a multi-thousand node cluster, this often > results in 10s+ load time for the health page. 10 concurrent requests were > found to cause 7m+ load times during which time write operations blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira