[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

Eli Collins (JIRA) Tue, 16 Oct 2012 09:27:05 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477131#comment-13477131
 ]


Eli Collins commented on HDFS-3990:
-----------------------------------

bq. They will change when a pre-existing node, say one with the same storage 
id, is updated with the new info.

I'm not sure re-registering with a new IP and the same storage ID actually 
works today.

bq. The patch appears to change the way the include and exclude work by 
trusting who the datanode claims to be. What if a datanode "lies" about who it 
is? Or if a dns hiccup occurs when the datanode is going to register? It sends 
its name as an ip, but the exclude list only has hosts. There are a number of 
scenarios where a datanode could bypass the include/exclude list, which is why 
we should never trust the client.

Take another look at the patch, the NN is doing the lookup not the DN, just at 
registration time. How about we reject the DN registration in case of a DNS 
hiccup (rather than use the DN value which the patch currently does in this 
case)? The DN will retry until it succeeds.  When working on HDFS-3171 I 
considered removing the ability for the DN to override the hostname, and have 
just one lookup per DN (ie currently both the NN and DN resolve the DN 
hostname). We could open a separate jira for that, might be easier to layer 
this one atop it.

I'm against having DatanodeID fields that duplicates the other fields since I 
think we can solve the problem here and avoid doing so. My experience from 
HDFS-3144 indicates we will introduce bugs and it's hard to correctly untangle 
later.
                
> NN's health report has severe performance problems
> --------------------------------------------------
>
>                 Key: HDFS-3990
>                 URL: https://issues.apache.org/jira/browse/HDFS-3990
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-3990.patch, HDFS-3990.patch, hdfs-3990.txt
>
>
> The dfshealth page will place a read lock on the namespace while it does a 
> dns lookup for every DN.  On a multi-thousand node cluster, this often 
> results in 10s+ load time for the health page.  10 concurrent requests were 
> found to cause 7m+ load times during which time write operations blocked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3990) NN's health report has severe performance problems

Reply via email to