[ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127234#comment-17127234 ]
Prabhakar commented on SPARK-29640: ----------------------------------- Is there a way to explicitly configure the Kube API server Url? if so, specifying the complete DNS name e.g. kubernetes.default.svc.cluster.local might help > [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in > Spark driver > ------------------------------------------------------------------------------------------ > > Key: SPARK-29640 > URL: https://issues.apache.org/jira/browse/SPARK-29640 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core > Affects Versions: 2.4.4 > Reporter: Andy Grove > Priority: Major > > We are running into intermittent DNS issues where the Spark driver fails to > resolve "kubernetes.default.svc" when trying to create executors. We are > running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS. > This happens approximately 10% of the time. > Here is the stack trace: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: External > scheduler cannot be instantiated > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794) > at org.apache.spark.SparkContext.<init>(SparkContext.scala:493) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) > at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36) > at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: > [get] for kind: [Pod] with name: > [wf-50000-69674f15d0fc45-1571354060179-driver] in namespace: > [tenant-8-workflows] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788) > ... 20 more > Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) > at java.net.InetAddress.getAllByName0(InetAddress.java:1277) > at java.net.InetAddress.getAllByName(InetAddress.java:1193) > at java.net.InetAddress.getAllByName(InetAddress.java:1127) > at okhttp3.Dns$1.lookup(Dns.java:39) > at > okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171) > at > okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137) > at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82) > at > okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171) > at > okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121) > at > okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100) > at > okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) > at > okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) > at > okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) > at > okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) > at > io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) > at > io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) > at > okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) > at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) > at okhttp3.RealCall.execute(RealCall.java:69) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330) > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218) > ... 27 more {code} > This issue seems to be caused by > [https://github.com/kubernetes/kubernetes/issues/76790] > One suggested workaround is to specify TCP mode for DNS lookups in the pod > spec > ([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]). > I would like the ability to provide a flag to spark-submit to specify to use > TCP mode for DNS lookups. > I am working on a PR for this. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org