Ankit Prakash Gupta created SPARK-38050:
-------------------------------------------
Summary: Spark Job fails intermittently as the connection times
out while checking the driver pod's status during the creation of
ExecutorPodsAllocator.
Key: SPARK-38050
URL: https://issues.apache.org/jira/browse/SPARK-38050
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: 3.2.0
Reporter: Ankit Prakash Gupta
So, during the creation of SchedulerBackend for Kubernetes Resource Manager,
when ExecutorPodsAllocator is created, it check the spark driver pod, if it is
there or not, It tries and hits a curl request to K8s Master, to get the status
of the pod. Curl request times out many times, which makes the Spark Job fail.
As a solution, a simple retry would also be very helpful, since during heavy
load on k8s master, but on retrying it might be able to connect and get
driver's Pod Object.
{code:java}
22/01/27 01:00:35 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2979)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:559)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:189)
at
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
...
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation:
[get] for kind: [Pod] with name: [<driver-pod-name>] in namespace:
[<namespace>] failed.
at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:226)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:187)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:86)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$driverPod$1(ExecutorPodsAllocator.scala:79)
at scala.Option.map(Option.scala:230)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:78)
at
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:118)
at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2973)
... 25 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at okhttp3.internal.platform.Platform.connectSocket(Platform.java:129)
at
okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:247)
at
okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167)
at
okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
at
okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
at
okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at
io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:133)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at
io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:42)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at
io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at
io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createApplicableInterceptors$6(HttpClientUtils.java:284)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at okhttp3.RealCall.execute(RealCall.java:93)
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:541)
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:504)
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:471)
at
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:453)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:947)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:221)
... 32 more {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]