Hi Dongjoon
Thx for replying and clarifying.
Below are the errors in Spark32 pn K8s which occurred because of time
out .
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]
for kind: [ConfigMap] with name: [null] in namespace: [xyz] failed.
at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
at
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.setUpExecutorConfigMap(KubernetesClusterSchedulerBackend.scala:80)
at
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:103)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
io.fabric8.kubernetes.client.KubernetesClientException:
Operation: [create] for kind: [Pod] with name: [null] in
namespace: [xyz] failed.
at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:400)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
Above both errors are occurring because of timeout .
Caused by: java.net.SocketTimeoutException: timeout
at okio.Okio$4.newTimeoutException(Okio.java:232)
at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)
at
okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)
at
okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
at
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
Setting these values *spark.kubernetes.driver.connectionTimeout
spark.kubernetes.submission.connectionTimeout* to higher value made this
work . Since spark.network.timeout was already* s*et , I was wondering why
to set this time out separately . But your explanation helps me
to understand things better.
As you have suggested , *IMO adding a better error message in case of K8s
timeout would be better for the user debuggability.*
On Tue, Aug 2, 2022 at 3:55 AM Dongjoon Hyun
wrote:
> Hi, Pralabh.
>
> Could you elaborate on your situation more? I'm interested in your needs.
>
> Currently, the default value of spark.network.timeout, 120s, is quite
> bigger than the default value of
> spark.kubernetes.driver.connectionTimeout, 10s. It would be a breaking
> change if we increase `spark.kubernetes.driver.connectionTimeout` to
> `120s` blindly in the next release.
>
> In addition, I don't think it's a good idea to adjust
> `spark.network.timeout` for K8s control plane timeout issues.
> `spark.network.timeout` has already many other side-effects in Spark
> operation itself. I'd recommend having more directional error messages
> to guide those novice users in that situation instead.
>
> Lastly, the most expensive API call is polling the executor status. To
> reduce the overhead of K8s server side and mitigate the root cause,
> Apache Spark 3.3.0 allows K8s API server-side caching via SPARK-36334.
> You may want to try the follo