jlpedrosa commented on a change in pull request #24702: [SPARK-27989] [Kubernetes] [Core] Added retries on the connection to the driver for k8s URL: https://github.com/apache/spark/pull/24702#discussion_r292395924
########## File path: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile ########## @@ -51,6 +51,8 @@ ENV SPARK_HOME /opt/spark WORKDIR /opt/spark/work-dir RUN chmod g+w /opt/spark/work-dir +#Disable negative dns reslolution https://docs.oracle.com/javase/8/docs/technotes/guides/net/properties.html +RUN sed -i -e 's/networkaddress.cache.negative.ttl=10/networkaddress.cache.negative.ttl=0/g' /usr/lib/jvm/java-1.8-openjdk/jre/lib/security/java.security Review comment: Hi @srowen Let me answer properly. AFAIK that DNS negative lookup is NOT cached, not at least in linux, positive caching yes, retries and timeouts, and it also depends on distribution, not all of them have it enabled (and there are different ways to achieve so). What I meant is due to the fact that is cached, then retries won't work because they'll happen in a tight loop (the for loop we were adding), unless we start adding sleeps in the scala code, which may or may not be enough depending on the resolution timeouts at OS level. The end result of the caching of the negative resolution at java code is that final users end up with a very strange behaviour where the timeouts of the OS level are only respected once. What will happen if we don't disable caching, the first call to resolve, will wait the OS timeout (I think 5 seconds is the default), then the subsequent calls will just don't even try because the java layer won't try to invoke it. [here](https://github.com/bpupadhyaya/openjdk-8/blob/45af329463a45955ea2759b89cb0ebfe40570c3f/jdk/src/share/classes/java/net/InetAddress.java#L1251) and [here](https://github.com/bpupadhyaya/openjdk-8/blob/45af329463a45955ea2759b89cb0ebfe40570c3f/jdk/src/share/classes/java/net/InetAddress.java#L885) So what this means is, if we don't disable negative DNS caches, then retries won't be able to solve the issue it in a reliable manner (a complicated combination of retries, sleeps in scala, and OS timers would be able to solve it). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
