Karthik Palaniappan created HADOOP-15129:
--------------------------------------------
Summary: Datanode caches namenode DNS lookup failure and cannot
startup
Key: HADOOP-15129
URL: https://issues.apache.org/jira/browse/HADOOP-15129
Project: Hadoop Common
Issue Type: Bug
Components: ipc
Affects Versions: 2.8.2
Environment: Google Compute Engine, or any environment where a small
percent of DNS lookups fail.
I'm using Java 8, Debian 8, Hadoop 2.8.2.
Reporter: Karthik Palaniappan
Priority: Minor
On startup, the Datanode creates an InetSocketAddress to register with each
namenode. Though there are retries on connection failure throughout the stack,
the same InetSocketAddress is reused.
InetSocketAddress is an interesting class, because it resolves DNS names to IP
addresses on construction, and it is never refreshed. Hadoop re-creates an
InetSocketAddress in some cases just in case the remote IP has changed for a
particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
Anyway, on startup, you cna see the Datanode log: "Namenode...remains
unresolved" -- referring to the fact that DNS lookup failed.
{code:java}
2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Refresh request received for nameservices: null
2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode for
null remains unresolved for ID null. Check your hdfs-site.xml file to ensure
namenodes are configured properly.
2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Starting BPOfferServices for nameservices: <default>
2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Block pool <registering> (Datanode Uuid unassigned) service to
cluster-32f5-m:8020 starting to offer service
{code}
The Datanode then proceeds to use this unresolved address, as it may work if
the DN is configured to use a proxy. Since I'm not using a proxy, it forever
prints out this message:
{code:java}
2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Problem connecting to server: cluster-32f5-m:8020
2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Problem connecting to server: cluster-32f5-m:8020
{code}
Unfortunately, the log doesn't contain the exception that triggered it, but the
culprit is actually in IPC Client:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.
This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 to
give a clear error message when somebody mispells an address.
However, the fix in HADOOP-7472 doesn't apply here, because that code happens
in Client#getConnection after the Connection is constructed.
My proposed fix (will attach a patch) is to move this exception out of the
constructor and into a place that will trigger HADOOP-7472's logic to
re-resolve addresses. If the DNS failure was temporary, this will allow the
connection to succeed. If not, the connection will fail after ipc client
retries (default 10 seconds worth of retries).
I want to fix this in ipc client rather than just in Datanode startup, as this
fixes temporary DNS issues for all of Hadoop.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]