[jira] [Comment Edited] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup

2021-08-27 Thread Chris Nauroth (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406116#comment-17406116
 ] 

Chris Nauroth edited comment on HADOOP-15129 at 8/28/21, 4:10 AM:
--

This remains a problem for cloud infrastructure deployments, so I'd like to 
pick it up and see if we can get it completed.  I've sent a pull request with 
the following changes compared to the prior revision:

* Remove older code for a throw of {{UnknownHostException}}.  This lies outside 
the retry loop, so even though the earlier patch did the right thing by placing 
the throw inside the retry loop, this remaining code perpetuated the problem of 
an infinite unresolved host.
* Make minor formatting changes in the test to resolve Checkstyle issues 
flagged in the last Yetus run.

Additionally, I've confirmed testing of the patch in moderate-sized (200-node) 
Dataproc cluster deployments.

[~ste...@apache.org], [~arp], [~raviprak], [~ajayydv], and [~shahrs87], can we 
please work on getting this reviewed and committed?  I'm interested in merging 
this down to branch-3.3, branch-3.2, branch-2.10 and branch-2.9.  The patch 
as-is won't apply cleanly to 2.x.  If you approve, then I'll prepare separate 
pull requests for those branches.

Also, BTW, hello everyone.  :-)


was (Author: cnauroth):
This remains a problem for cloud infrastructure deployments, so I'd like to 
pick it up and see if we can get it completed.  I've sent a pull request with 
the following changes compared to the prior revision:

* Remove older code for a throw of {{UnknownHostException}}.  This lies outside 
the retry loop, so even though the earlier patch did the right thing by placing 
the throw inside the retry loop, this remaining code perpetuated the problem of 
an infinite unresolved host.
* Make minor formatting changes in the test to resolve Checkstyle issues 
flagged in the last Yetus run.

Additionally, I've confirmed testing of the patch in moderate-sized (200-node) 
Dataproc cluster deployments.

[~ste...@apache.org], [~arp], [~raviprak], [~ajayydv], and [~shahrs87], can we 
please work on getting this reviewed and committed?  I'm interested in merging 
this down to branch-3.3, branch-3.2, branch-2.10 and branch-2.10.  The patch 
as-is won't apply cleanly to 2.x.  If you approve, then I'll prepare separate 
pull requests for those branches.

Also, BTW, hello everyone.  :-)

> Datanode caches namenode DNS lookup failure and cannot startup
> --
>
> Key: HADOOP-15129
> URL: https://issues.apache.org/jira/browse/HADOOP-15129
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.8.2
> Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>Reporter: Karthik Palaniappan
>Assignee: Chris Nauroth
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HADOOP-15129.001.patch, HADOOP-15129.002.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On startup, the Datanode creates an InetSocketAddress to register with each 
> namenode. Though there are retries on connection failure throughout the 
> stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to 
> IP addresses on construction, and it is never refreshed. Hadoop re-creates an 
> InetSocketAddress in some cases just in case the remote IP has changed for a 
> particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains 
> unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode 
> for null remains unresolved for ID null. Check your hdfs-site.xml file to 
> ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Starting BPOfferServices for nameservices: 
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Block pool  (Datanode Uuid unassigned) service to 
> cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if 
> the DN is configured to use a proxy. Since I'm not using a proxy, it forever 
> prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: 

[jira] [Comment Edited] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup

2018-12-12 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718798#comment-16718798
 ] 

Ravi Prakash edited comment on HADOOP-15129 at 12/12/18 11:10 AM:
--

Hi Karthik! Thanks for your contribution. Could you please rebase the patch to 
the latest trunk? I usually apply patches using
{code:java}
$ git apply {code}
A few suggestions:
 # Could you please use short descriptions in JIRA? [I was told a long time 
ago|https://issues.apache.org/jira/browse/HDFS-2011?focusedCommentId=13041707=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041707].
 :)
 # When using JIRA numbers, could you please write HDFS-8068 (instead of just 
8068) because issues often cut across several different projects, and this way 
JIRA creates nice links for viewers to click on?

Patches are usually committed to trunk *first* and then a (possibly) different 
version of the patch may be committed to earlier branches like branch-2. So 
technically you could have used neat Lambdas in the trunk patch. ;) Its a nit 
though.

I'm trying to find the wikipage that tried to explain certain errors. I'm 
afraid I rarely found them useful (its probably because we didn't really expand 
on those wiki pages ever), so I'm fine with a more helpful error in the logs.

 


was (Author: raviprak):
Hi Karthik! Thanks for your contribution. Could you please rebase the patch to 
the latest trunk? I usually apply patches using
{code:java}
$ git apply {code}
A few suggestions:
 # Could you please use short descriptions in JIRA? I was told a long time ago. 
:)
 # When using JIRA numbers, could you please write HDFS-8068 (instead of just 
8068) because issues often cut across several different projects, and this way 
JIRA creates nice links for viewers to click on?

Patches are usually committed to trunk *first* and then a (possibly) different 
version of the patch may be committed to earlier branches like branch-2. So 
technically you could have used neat Lambdas in the trunk patch. ;) Its a nit 
though.

I'm trying to find the wikipage that tried to explain certain errors. I'm 
afraid I rarely found them useful (its probably because we didn't really expand 
on those wiki pages ever), so I'm fine with a more helpful error in the logs.

 

> Datanode caches namenode DNS lookup failure and cannot startup
> --
>
> Key: HADOOP-15129
> URL: https://issues.apache.org/jira/browse/HADOOP-15129
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.8.2
> Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>Reporter: Karthik Palaniappan
>Assignee: Karthik Palaniappan
>Priority: Minor
> Attachments: HADOOP-15129.001.patch, HADOOP-15129.002.patch
>
>
> On startup, the Datanode creates an InetSocketAddress to register with each 
> namenode. Though there are retries on connection failure throughout the 
> stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to 
> IP addresses on construction, and it is never refreshed. Hadoop re-creates an 
> InetSocketAddress in some cases just in case the remote IP has changed for a 
> particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains 
> unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode 
> for null remains unresolved for ID null. Check your hdfs-site.xml file to 
> ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Starting BPOfferServices for nameservices: 
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Block pool  (Datanode Uuid unassigned) service to 
> cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if 
> the DN is configured to use a proxy. Since I'm not using a proxy, it forever 
> prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:55,713 WARN 

[jira] [Comment Edited] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup

2018-12-12 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718798#comment-16718798
 ] 

Ravi Prakash edited comment on HADOOP-15129 at 12/12/18 11:13 AM:
--

Hi Karthik! Thanks for your contribution. Could you please rebase the patch to 
the latest trunk? I usually apply patches using
{code:java}
$ git apply {code}
A few suggestions:
 # Could you please use short descriptions in JIRA? I was told a long time ago. 
:)
 # When using JIRA numbers, could you please write HDFS-8068 (instead of just 
8068) because issues often cut across several different projects, and this way 
JIRA creates nice links for viewers to click on?

Patches are usually committed to trunk *first* and then a (possibly) different 
version of the patch may be committed to earlier branches like branch-2. So 
technically you could have used neat Lambdas in the trunk patch. ;) Its a nit 
though.

I'm trying to find the wikipage that tried to explain certain errors. I'm 
afraid I rarely found them useful (its probably because we didn't really expand 
on those wiki pages ever), so I'm fine with a more helpful error in the logs.

Could you please also comment on whether you have been running with this patch 
in production for any amount of time and seen / not seen any issues with it?

I concur that this is extremely important code, so it behooves us to tread very 
carefully. 


was (Author: raviprak):
Hi Karthik! Thanks for your contribution. Could you please rebase the patch to 
the latest trunk? I usually apply patches using
{code:java}
$ git apply {code}
A few suggestions:
 # Could you please use short descriptions in JIRA? [I was told a long time 
ago|https://issues.apache.org/jira/browse/HDFS-2011?focusedCommentId=13041707=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041707].
 :)
 # When using JIRA numbers, could you please write HDFS-8068 (instead of just 
8068) because issues often cut across several different projects, and this way 
JIRA creates nice links for viewers to click on?

Patches are usually committed to trunk *first* and then a (possibly) different 
version of the patch may be committed to earlier branches like branch-2. So 
technically you could have used neat Lambdas in the trunk patch. ;) Its a nit 
though.

I'm trying to find the wikipage that tried to explain certain errors. I'm 
afraid I rarely found them useful (its probably because we didn't really expand 
on those wiki pages ever), so I'm fine with a more helpful error in the logs.

 

> Datanode caches namenode DNS lookup failure and cannot startup
> --
>
> Key: HADOOP-15129
> URL: https://issues.apache.org/jira/browse/HADOOP-15129
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.8.2
> Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>Reporter: Karthik Palaniappan
>Assignee: Karthik Palaniappan
>Priority: Minor
> Attachments: HADOOP-15129.001.patch, HADOOP-15129.002.patch
>
>
> On startup, the Datanode creates an InetSocketAddress to register with each 
> namenode. Though there are retries on connection failure throughout the 
> stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to 
> IP addresses on construction, and it is never refreshed. Hadoop re-creates an 
> InetSocketAddress in some cases just in case the remote IP has changed for a 
> particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains 
> unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode 
> for null remains unresolved for ID null. Check your hdfs-site.xml file to 
> ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Starting BPOfferServices for nameservices: 
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Block pool  (Datanode Uuid unassigned) service to 
> cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if 
> the DN is configured to use a proxy. Since I'm not using a proxy, it forever 
> prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN 

[jira] [Comment Edited] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup

2018-02-12 Thread Ajay Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361506#comment-16361506
 ] 

Ajay Kumar edited comment on HADOOP-15129 at 2/12/18 9:54 PM:
--

{quote} The local IP address for the connection is indeterminate until the 
target hostname is resolved e.g. multi-homed setups. So I am okay with leaving 
this as null for now.{quote}
[~arpitagarwal], I see your point. I was thinking that localhost "UNKOWN" may 
mislead someone in wrong diagnosis about the problem. Wondering if we can we 
improve the messaging. But this can be discussed separately and doesn't need to 
be part of this jira.


was (Author: ajayydv):
{quote} The local IP address for the connection is indeterminate until the 
target hostname is resolved e.g. multi-homed setups. So I am okay with leaving 
this as null for now.{quote}
I see your point. I was thinking that localhost "UNKOWN" may mislead someone in 
wrong diagnosis about the problem. Wondering if we can we improve the 
messaging. But this can be discussed separately and doesn't need to be part of 
this jira.

> Datanode caches namenode DNS lookup failure and cannot startup
> --
>
> Key: HADOOP-15129
> URL: https://issues.apache.org/jira/browse/HADOOP-15129
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.8.2
> Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>Reporter: Karthik Palaniappan
>Assignee: Karthik Palaniappan
>Priority: Minor
> Attachments: HADOOP-15129.001.patch, HADOOP-15129.002.patch
>
>
> On startup, the Datanode creates an InetSocketAddress to register with each 
> namenode. Though there are retries on connection failure throughout the 
> stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to 
> IP addresses on construction, and it is never refreshed. Hadoop re-creates an 
> InetSocketAddress in some cases just in case the remote IP has changed for a 
> particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains 
> unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode 
> for null remains unresolved for ID null. Check your hdfs-site.xml file to 
> ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Starting BPOfferServices for nameservices: 
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Block pool  (Datanode Uuid unassigned) service to 
> cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if 
> the DN is configured to use a proxy. Since I'm not using a proxy, it forever 
> prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> {code}
> Unfortunately, the log doesn't contain the exception that triggered it, but 
> the culprit is actually in IPC Client: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.
> This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 
> to give a clear error message when somebody mispells an address.
> However, the fix in HADOOP-7472 doesn't apply here, because that code happens 
> in Client#getConnection after the Connection is constructed.
> My proposed fix (will attach a patch) is to move this exception out of the 
> constructor and into a place that will trigger HADOOP-7472's logic to 
> re-resolve addresses. If the DNS failure was temporary, this will allow the 
> connection to succeed. If not, the connection will fail after ipc client 
> retries (default 10 seconds worth of retries).
> I want to fix this in ipc client rather than just in Datanode startup, as 
> 

[jira] [Comment Edited] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup

2018-02-12 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361358#comment-16361358
 ] 

Arpit Agarwal edited comment on HADOOP-15129 at 2/12/18 7:53 PM:
-

The test case {{testIpcFlakyHostResolution}} also lgtm. The other new test 
{{testIpcHostResolutionTimeout}} looks unrelated to this change. Do we need it 
[~Karthik Palaniappan]?

bq. can we pass "localhost" for NetUtils.wrapException instead of null. Above 
message in logs is little misleading.
Good point. The local IP address for the connection is indeterminate until the 
target hostname is resolved e.g. multi-homed setups. So I am okay with leaving 
this as null for now. [~ajayydv], let me know if that sounds reasonable.


was (Author: arpitagarwal):
The test case {{testIpcFlakyHostResolution}} also lgtm. The other new test 
{{testIpcHostResolutionTimeout}} looks related to this change. Do we need it 
[~Karthik Palaniappan]?

bq. can we pass "localhost" for NetUtils.wrapException instead of null. Above 
message in logs is little misleading.
Good point. The local IP address for the connection is indeterminate until the 
target hostname is resolved e.g. multi-homed setups. So I am okay with leaving 
this as null for now. [~ajayydv], let me know if that sounds reasonable.

> Datanode caches namenode DNS lookup failure and cannot startup
> --
>
> Key: HADOOP-15129
> URL: https://issues.apache.org/jira/browse/HADOOP-15129
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.8.2
> Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>Reporter: Karthik Palaniappan
>Assignee: Karthik Palaniappan
>Priority: Minor
> Attachments: HADOOP-15129.001.patch, HADOOP-15129.002.patch
>
>
> On startup, the Datanode creates an InetSocketAddress to register with each 
> namenode. Though there are retries on connection failure throughout the 
> stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to 
> IP addresses on construction, and it is never refreshed. Hadoop re-creates an 
> InetSocketAddress in some cases just in case the remote IP has changed for a 
> particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains 
> unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode 
> for null remains unresolved for ID null. Check your hdfs-site.xml file to 
> ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Starting BPOfferServices for nameservices: 
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Block pool  (Datanode Uuid unassigned) service to 
> cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if 
> the DN is configured to use a proxy. Since I'm not using a proxy, it forever 
> prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> {code}
> Unfortunately, the log doesn't contain the exception that triggered it, but 
> the culprit is actually in IPC Client: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.
> This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 
> to give a clear error message when somebody mispells an address.
> However, the fix in HADOOP-7472 doesn't apply here, because that code happens 
> in Client#getConnection after the Connection is constructed.
> My proposed fix (will attach a patch) is to move this exception out of the 
> constructor and into a place that will trigger HADOOP-7472's logic to 
> re-resolve addresses. If the DNS failure was temporary, this will allow the 
> 

[jira] [Comment Edited] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup

2017-12-19 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296888#comment-16296888
 ] 

Rushabh S Shah edited comment on HADOOP-15129 at 12/19/17 2:40 PM:
---

Isn't this dupe of HDFS-8068 ?
Also it is related to HADOOP-12125.


was (Author: shahrs87):
Isn't this dupe of HDFS-8068 ?

> Datanode caches namenode DNS lookup failure and cannot startup
> --
>
> Key: HADOOP-15129
> URL: https://issues.apache.org/jira/browse/HADOOP-15129
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.8.2
> Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>Reporter: Karthik Palaniappan
>Priority: Minor
> Attachments: HADOOP-15129.001.patch
>
>
> On startup, the Datanode creates an InetSocketAddress to register with each 
> namenode. Though there are retries on connection failure throughout the 
> stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to 
> IP addresses on construction, and it is never refreshed. Hadoop re-creates an 
> InetSocketAddress in some cases just in case the remote IP has changed for a 
> particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains 
> unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode 
> for null remains unresolved for ID null. Check your hdfs-site.xml file to 
> ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Starting BPOfferServices for nameservices: 
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Block pool  (Datanode Uuid unassigned) service to 
> cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if 
> the DN is configured to use a proxy. Since I'm not using a proxy, it forever 
> prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> {code}
> Unfortunately, the log doesn't contain the exception that triggered it, but 
> the culprit is actually in IPC Client: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.
> This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 
> to give a clear error message when somebody mispells an address.
> However, the fix in HADOOP-7472 doesn't apply here, because that code happens 
> in Client#getConnection after the Connection is constructed.
> My proposed fix (will attach a patch) is to move this exception out of the 
> constructor and into a place that will trigger HADOOP-7472's logic to 
> re-resolve addresses. If the DNS failure was temporary, this will allow the 
> connection to succeed. If not, the connection will fail after ipc client 
> retries (default 10 seconds worth of retries).
> I want to fix this in ipc client rather than just in Datanode startup, as 
> this fixes temporary DNS issues for all of Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org