[ 
https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127234#comment-17127234
 ] 

Prabhakar commented on SPARK-29640:
-----------------------------------

Is there a way to explicitly configure the Kube API server Url? if so, 
specifying the complete DNS name

e.g. kubernetes.default.svc.cluster.local might help

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in 
> Spark driver
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29640
>                 URL: https://issues.apache.org/jira/browse/SPARK-29640
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes, Spark Core
>    Affects Versions: 2.4.4
>            Reporter: Andy Grove
>            Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to 
> resolve "kubernetes.default.svc" when trying to create executors. We are 
> running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External 
> scheduler cannot be instantiated
>       at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
>       at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
>       at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>       at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
>       at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
>       at scala.Option.getOrElse(Option.scala:121)
>       at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
>       at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
>       at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>       at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>       at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>       at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>       at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>       at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
> [get]  for kind: [Pod]  with name: 
> [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace: 
> [tenant-8-workflows]  failed.
>       at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>       at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
>       at scala.Option.map(Option.scala:146)
>       at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
>       at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
>       at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
>       ... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
>       at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>       at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
>       at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
>       at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1193)
>       at java.net.InetAddress.getAllByName(InetAddress.java:1127)
>       at okhttp3.Dns$1.lookup(Dns.java:39)
>       at 
> okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
>       at 
> okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
>       at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
>       at 
> okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
>       at 
> okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
>       at 
> okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
>       at 
> okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>       at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>       at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
>       at okhttp3.RealCall.execute(RealCall.java:69)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
>       at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
>       at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
>       ... 27 more  {code}
> This issue seems to be caused by 
> [https://github.com/kubernetes/kubernetes/issues/76790]
> One suggested workaround is to specify TCP mode for DNS lookups in the pod 
> spec 
> ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]).
> I would like the ability to provide a flag to spark-submit to specify to use 
> TCP mode for DNS lookups.
> I am working on a PR for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to