[
https://issues.apache.org/jira/browse/HADOOP-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922460#comment-16922460
]
Steve Loughran commented on HADOOP-16543:
-----------------------------------------
# Java likes this caching and does it a lot internally. Its been known to cache
negative DNS entries in the past too, which was always a nightmare
# a lot of this caching is going to be in layers (httpclient, gRpc) beneath the
hadoop code
For the specific case of hadoop's own services,
* they should think about using a registry service (hadoop registry, etcd, ..)
to find things on failure rather than just spin, though changing hostnames
complicates kerberos in ways I fear.
* There are probably lots of places we haven't discovered which need fixing.
I propose
* you explore changing the Java DNS TTL to see what difference that makes.
* after doing that, if there are places in a deep CodeBase where we're caching
DNS entries, we can worry about fixing that.
* if they are in dependent libraries, it'll have to span projects.
* if it's a matter of documentation a new document could be started covering
the challenge of deploying hadoop applications in this world.
* target the trunk branch for fixes; backporting can follow
I'm supportive of this effort, just avoiding committing anything except what I
can do to review your work. Be advised, I'm never happy going near the IPC code
myself, so reviews from others will be needed there.
> Cached DNS name resolution error
> --------------------------------
>
> Key: HADOOP-16543
> URL: https://issues.apache.org/jira/browse/HADOOP-16543
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 3.1.2
> Reporter: Roger Liu
> Priority: Major
>
> In Kubernetes, the a node may go down and then come back later with a
> different IP address. Yarn clients which are already running will be unable
> to rediscover the node after it comes back up due to caching the original IP
> address. This is problematic for cases such as Spark HA on Kubernetes, as the
> node containing the resource manager may go down and come back up, meaning
> existing node managers must then also be restarted.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]