Re: Setting spark.kubernetes.driver.connectionTimeout, spark.kubernetes.submission.connectionTimeout to default spark.network.timeout

2022-08-02 Thread Pralabh Kumar
Hi Dongjoon

Thx for replying and clarifying.

Below are the errors in Spark32 pn K8s  which occurred because of time
out .

io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]
 for kind: [ConfigMap]  with name: [null]  in namespace: [xyz]  failed.

at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)

at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)

at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)

at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)

at
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.setUpExecutorConfigMap(KubernetesClusterSchedulerBackend.scala:80)

at
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:103)

at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)





io.fabric8.kubernetes.client.KubernetesClientException:
Operation: [create]  for kind: [Pod]  with name: [null]  in
namespace: [xyz]  failed.

at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)

at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)

at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)

at
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)

at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:400)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)

at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)

at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)

at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)







Above both errors are occurring because of  timeout .



Caused by: java.net.SocketTimeoutException: timeout

at okio.Okio$4.newTimeoutException(Okio.java:232)

at okio.AsyncTimeout.exit(AsyncTimeout.java:285)

at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)

at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)

at
okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)

at
okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)

at
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)

at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)

at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)

at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)

at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)





Setting these values *spark.kubernetes.driver.connectionTimeout
spark.kubernetes.submission.connectionTimeout*  to higher value made this
work .  Since spark.network.timeout was already* s*et , I was wondering why
to set this time out separately . But your explanation helps me
to understand things better.



As you have suggested , *IMO adding a better error message in case of K8s
timeout would be better for the user debuggability.*


On Tue, Aug 2, 2022 at 3:55 AM Dongjoon Hyun 
wrote:

> Hi, Pralabh.
>
> Could you elaborate on your situation more? I'm interested in your needs.
>
> Currently, the default value of spark.network.timeout, 120s, is quite
> bigger than the default value of
> spark.kubernetes.driver.connectionTimeout, 10s. It would be a breaking
> change if we increase `spark.kubernetes.driver.connectionTimeout` to
> `120s` blindly in the next release.
>
> In addition, I don't think it's a good idea to adjust
> `spark.network.timeout` for K8s control plane timeout issues.
> `spark.network.timeout` has already many other side-effects in Spark
> operation itself. I'd recommend having more directional error messages
> to guide those novice users in that situation instead.
>
> Lastly, the most expensive API call is polling the executor status. To
> reduce the overhead of K8s server side and mitigate the root cause,
> Apache Spark 3.3.0 allows K8s API server-side caching via SPARK-36334.
> You may want to try the follo

Re: Setting spark.kubernetes.driver.connectionTimeout, spark.kubernetes.submission.connectionTimeout to default spark.network.timeout

2022-08-01 Thread Dongjoon Hyun
Hi, Pralabh.

Could you elaborate on your situation more? I'm interested in your needs.

Currently, the default value of spark.network.timeout, 120s, is quite
bigger than the default value of
spark.kubernetes.driver.connectionTimeout, 10s. It would be a breaking
change if we increase `spark.kubernetes.driver.connectionTimeout` to
`120s` blindly in the next release.

In addition, I don't think it's a good idea to adjust
`spark.network.timeout` for K8s control plane timeout issues.
`spark.network.timeout` has already many other side-effects in Spark
operation itself. I'd recommend having more directional error messages
to guide those novice users in that situation instead.

Lastly, the most expensive API call is polling the executor status. To
reduce the overhead of K8s server side and mitigate the root cause,
Apache Spark 3.3.0 allows K8s API server-side caching via SPARK-36334.
You may want to try the following configuration if you have very
limited control plan resources.

spark.kubernetes.executor.enablePollingWithResourceVersion=true

Dongjoon.

On Mon, Aug 1, 2022 at 7:52 AM Pralabh Kumar  wrote:
>
> Hi Dev team
>
>
>
> Since spark.network.timeout is default for all the network transactions . 
> Shouldn’t   spark.kubernetes.driver.connectionTimeout, 
> spark.kubernetes.submission.connectionTimeout by default to be set 
> spark.network.timeout .
>
> Users migrating from Yarn to K8s are familiar with spark.network.timeout and 
> if time out occurs on K8s , they need to explicitly set the above two 
> properties. If the above properties are default set to spark.network.timeout 
> then user don’t need to explicitly set above properties and it can work with 
> spark.network.timeout.
>
>
>
> Please let me know if my understanding is correct
>
>
>
> Regards
>
> Pralabh Kumar

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org