[ 
https://issues.apache.org/jira/browse/HADOOP-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HADOOP-9150:
--------------------------------

    Attachment: tracing-resolver.tgz
                log.txt

To diagnose this, I wrote a wrapper implementation of the NameService SPI which 
logs all resolutions. Attached is the source for the tracing implementation 
along with a log I captured on a test cluster. Here you can see a DNS lookup 
coming from the path canonicalization code:

{code}
java.lang.Exception: looking up ha-nn-uri
        at MyNameservice.lookupAllHostAddr(MyNameservice.java:11)
...
        at 
org.apache.hadoop.security.SecurityUtil$StandardHostResolver.getByName(SecurityUtil.java:538)
        at 
org.apache.hadoop.security.SecurityUtil.getByName(SecurityUtil.java:526)
        at org.apache.hadoop.net.NetUtils.canonicalizeHost(NetUtils.java:283)
        at org.apache.hadoop.net.NetUtils.getCanonicalUri(NetUtils.java:255)
        at org.apache.hadoop.fs.FileSystem.getCanonicalUri(FileSystem.java:214)
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:524)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:170)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:401)
...
{code}
                
> Unnecessary DNS resolution attempts for logical URIs
> ----------------------------------------------------
>
>                 Key: HADOOP-9150
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9150
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0, 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Priority: Critical
>         Attachments: log.txt, tracing-resolver.tgz
>
>
> In the FileSystem code, we accidentally try to DNS-resolve the logical name 
> before it is converted to an actual domain name. In some DNS setups, this can 
> cause a big slowdown - eg in one misconfigured cluster we saw a 2-3x drop in 
> terasort throughput, since every task wasted a lot of time waiting for slow 
> "not found" responses from DNS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to