[ 
https://issues.apache.org/jira/browse/HADOOP-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922460#comment-16922460
 ] 

Steve Loughran commented on HADOOP-16543:
-----------------------------------------

# Java likes this caching and does it a lot internally. Its been known to cache 
negative DNS entries in the past too, which was always a nightmare
# a lot of this caching is going to be in layers (httpclient, gRpc) beneath the 
hadoop code

For the specific case of hadoop's own services, 
* they should think about using a registry service (hadoop registry, etcd, ..) 
to find things on failure rather than just spin, though changing hostnames 
complicates kerberos in ways I fear.
* There are probably lots of places we haven't discovered which need fixing.

I propose
* you explore changing the Java DNS TTL to see what difference that makes.
* after doing that, if there are places in a deep CodeBase where we're caching 
DNS entries, we can worry about fixing that.
* if they are in dependent libraries, it'll have to span projects.
* if it's a matter of documentation a new document could be started covering 
the challenge of deploying hadoop applications in this world. 
* target the trunk branch for fixes; backporting can follow

I'm supportive of this effort, just avoiding committing anything except what I 
can do to review your work. Be advised, I'm never happy going near the IPC code 
myself, so reviews from others will be needed there.

> Cached DNS name resolution error
> --------------------------------
>
>                 Key: HADOOP-16543
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16543
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: Roger Liu
>            Priority: Major
>
> In Kubernetes, the a node may go down and then come back later with a 
> different IP address. Yarn clients which are already running will be unable 
> to rediscover the node after it comes back up due to caching the original IP 
> address. This is problematic for cases such as Spark HA on Kubernetes, as the 
> node containing the resource manager may go down and come back up, meaning 
> existing node managers must then also be restarted.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to