[ 
https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994257#comment-16994257
 ] 

Andy Grove commented on SPARK-29640:
------------------------------------

We were finally able to get to a root cause on this so I'm documenting it here 
in the hopes that it helps someone else in the future.

The issue was due to the way that routing was set up on our EKS clusters 
combined with the fact that we were using an NLB rather than ELB along with 
nginx ingress controllers.

Specifically, NLB does not support "hairpinning" as explained in 
[https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html]

In layman's terms, if pod A tries to communicate with pod B, and both pods are 
on the same node and the request egresses from the node and is then routed back 
to the node via NLB and nginx controller then the request can never succeed and 
will time out.

Switching to an ELB resolves the issue but a better solution is to use cluster 
local addressing so that communicate between pods on the same nodes uses the 
local network.

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in 
> Spark driver
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29640
>                 URL: https://issues.apache.org/jira/browse/SPARK-29640
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.4
>            Reporter: Andy Grove
>            Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to 
> resolve "kubernetes.default.svc" when trying to create executors. We are 
> running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External 
> scheduler cannot be instantiated
>       at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
>       at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
>       at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>       at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
>       at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
>       at scala.Option.getOrElse(Option.scala:121)
>       at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
>       at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
>       at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>       at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>       at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>       at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>       at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>       at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
> [get]  for kind: [Pod]  with name: 
> [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace: 
> [tenant-8-workflows]  failed.
>       at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>       at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
>       at scala.Option.map(Option.scala:146)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
>       at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
>       at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
>       ... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
>       at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>       at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
>       at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
>       at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1193)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1127)
>       at okhttp3.Dns$1.lookup(Dns.java:39)
>       at 
> okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
>       at 
> okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
>       at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
>       at 
> okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
>       at 
> okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
>       at 
> okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
>       at 
> okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
>       at okhttp3.RealCall.execute(RealCall.java:69)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
>       ... 27 more  {code}
> This issue seems to be caused by 
> [https://github.com/kubernetes/kubernetes/issues/76790]
> One suggested workaround is to specify TCP mode for DNS lookups in the pod 
> spec 
> ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]).
> I would like the ability to provide a flag to spark-submit to specify to use 
> TCP mode for DNS lookups.
> I am working on a PR for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to