Ankit Prakash Gupta created SPARK-38050:
-------------------------------------------

             Summary: Spark Job fails intermittently as the connection times 
out while checking the driver pod's status during the creation of 
ExecutorPodsAllocator.
                 Key: SPARK-38050
                 URL: https://issues.apache.org/jira/browse/SPARK-38050
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: 3.2.0
            Reporter: Ankit Prakash Gupta


So, during the creation of SchedulerBackend for Kubernetes Resource Manager, 
when ExecutorPodsAllocator is created, it check the spark driver pod, if it is 
there or not, It tries and hits a curl request to K8s Master, to get the status 
of the pod. Curl request times out many times, which makes the Spark Job fail. 

As a solution, a simple retry would also be very helpful, since during heavy 
load on k8s master, but on retrying it might be able to connect and get 
driver's Pod Object.

 

 
{code:java}
22/01/27 01:00:35 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
        at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2979)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:559)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
        at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
        at scala.Option.getOrElse(Option.scala:189)
        at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
        ...
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
[get]  for kind: [Pod]  with name: [<driver-pod-name>]  in namespace: 
[<namespace>]  failed.
        at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
        at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:226)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:187)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:86)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$driverPod$1(ExecutorPodsAllocator.scala:79)
        at scala.Option.map(Option.scala:230)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:78)
        at 
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:118)
        at 
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2973)
        ... 25 more
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:607)
        at okhttp3.internal.platform.Platform.connectSocket(Platform.java:129)
        at 
okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:247)
        at 
okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167)
        at 
okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
        at 
okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
        at 
okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
        at 
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:133)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:42)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createApplicableInterceptors$6(HttpClientUtils.java:284)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
        at okhttp3.RealCall.execute(RealCall.java:93)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:541)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:504)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:471)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:453)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:947)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:221)
        ... 32 more {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to