Re: Spark 2.3 error on Kubernetes

2018-05-29 Thread Anirudh Ramanathan
Interesting.
Perhaps you could try resolving service addresses from within a pod and
seeing if there's some other issue causing intermittent failures in
resolution.
The steps here
<https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/#getting-a-shell-to-a-container>
may
be helpful.

On Tue, May 29, 2018 at 4:02 PM, purna pradeep 
wrote:

> Abirudh,
>
> Thanks for your response
>
> I’m running k8s cluster on AWS and kub-dns pods are running fine and also
> as I mentioned only 1 executor pod is running though I requested for 5 and
> rest 4 were killed with below error and I do have enough resources
> available.
>
> On Tue, May 29, 2018 at 6:28 PM Anirudh Ramanathan 
> wrote:
>
>> This looks to me like a kube-dns error that's causing the driver DNS
>> address to not resolve.
>> It would be worth double checking that kube-dns is indeed running (in the
>> kube-system namespace).
>> Often, with environments like minikube, kube-dns may exit/crashloop due
>> to lack of resource.
>>
>> On Tue, May 29, 2018 at 3:18 PM, purna pradeep 
>> wrote:
>>
>>> Hello,
>>>
>>> I’m getting below  error when I spark-submit a Spark 2.3 app on
>>> Kubernetes *v1.8.3* , some of the executor pods  were killed with below
>>> error as soon as they come up
>>>
>>> Exception in thread "main" java.lang.reflect.
>>> UndeclaredThrowableException
>>>
>>> at org.apache.hadoop.security.UserGroupInformation.doAs(
>>> UserGroupInformation.java:1713)
>>>
>>> at org.apache.spark.deploy.SparkHadoopUtil.
>>> runAsSparkUser(SparkHadoopUtil.scala:64)
>>>
>>> at org.apache.spark.executor.
>>> CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.
>>> scala:188)
>>>
>>> at org.apache.spark.executor.
>>> CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.
>>> scala:293)
>>>
>>> at org.apache.spark.executor.
>>> CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>>>
>>> Caused by: org.apache.spark.SparkException: Exception thrown in
>>> awaitResult:
>>>
>>> at org.apache.spark.util.ThreadUtils$.awaitResult(
>>> ThreadUtils.scala:205)
>>>
>>> at org.apache.spark.rpc.RpcTimeout.awaitResult(
>>> RpcTimeout.scala:75)
>>>
>>> at org.apache.spark.rpc.RpcEnv.
>>> setupEndpointRefByURI(RpcEnv.scala:101)
>>>
>>> at org.apache.spark.executor.
>>> CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(
>>> CoarseGrainedExecutorBackend.scala:201)
>>>
>>> at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(
>>> SparkHadoopUtil.scala:65)
>>>
>>> at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(
>>> SparkHadoopUtil.scala:64)
>>>
>>> at java.security.AccessController.doPrivileged(Native
>>> Method)
>>>
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>
>>> at org.apache.hadoop.security.UserGroupInformation.doAs(
>>> UserGroupInformation.java:1698)
>>>
>>> ... 4 more
>>>
>>> Caused by: java.io.IOException: Failed to connect to
>>> spark-1527629824987-driver-svc.spark.svc:7078
>>>
>>> at org.apache.spark.network.
>>> client.TransportClientFactory.createClient(TransportClientFactory.java:
>>> 245)
>>>
>>> at org.apache.spark.network.
>>> client.TransportClientFactory.createClient(TransportClientFactory.java:
>>> 187)
>>>
>>> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(
>>> NettyRpcEnv.scala:198)
>>>
>>> at org.apache.spark.rpc.netty.
>>> Outbox$$anon$1.call(Outbox.scala:194)
>>>
>>> at org.apache.spark.rpc.netty.
>>> Outbox$$anon$1.call(Outbox.scala:190)
>>>
>>> at java.util.concurrent.FutureTask.run(FutureTask.
>>> java:266)
>>>
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1149)
>>>
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:624)
>&

Re: Spark 2.3 error on Kubernetes

2018-05-29 Thread Anirudh Ramanathan
This looks to me like a kube-dns error that's causing the driver DNS
address to not resolve.
It would be worth double checking that kube-dns is indeed running (in the
kube-system namespace).
Often, with environments like minikube, kube-dns may exit/crashloop due to
lack of resource.

On Tue, May 29, 2018 at 3:18 PM, purna pradeep 
wrote:

> Hello,
>
> I’m getting below  error when I spark-submit a Spark 2.3 app on Kubernetes
> *v1.8.3* , some of the executor pods  were killed with below error as
> soon as they come up
>
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1713)
>
> at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(
> SparkHadoopUtil.scala:64)
>
> at org.apache.spark.executor.
> CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>
> at org.apache.spark.executor.
> CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
>
> at org.apache.spark.executor.CoarseGrainedExecutorBackend.
> main(CoarseGrainedExecutorBackend.scala)
>
> Caused by: org.apache.spark.SparkException: Exception thrown in
> awaitResult:
>
> at org.apache.spark.util.ThreadUtils$.awaitResult(
> ThreadUtils.scala:205)
>
> at org.apache.spark.rpc.RpcTimeout.awaitResult(
> RpcTimeout.scala:75)
>
> at org.apache.spark.rpc.RpcEnv.
> setupEndpointRefByURI(RpcEnv.scala:101)
>
> at org.apache.spark.executor.
> CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(
> CoarseGrainedExecutorBackend.scala:201)
>
> at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(
> SparkHadoopUtil.scala:65)
>
> at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(
> SparkHadoopUtil.scala:64)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:422)
>
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1698)
>
> ... 4 more
>
> Caused by: java.io.IOException: Failed to connect to
> spark-1527629824987-driver-svc.spark.svc:7078
>
> at org.apache.spark.network.client.TransportClientFactory.
> createClient(TransportClientFactory.java:245)
>
> at org.apache.spark.network.client.TransportClientFactory.
> createClient(TransportClientFactory.java:187)
>
> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(
> NettyRpcEnv.scala:198)
>
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.
> scala:194)
>
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.
> scala:190)
>
> at java.util.concurrent.FutureTask.run(FutureTask.
> java:266)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.net.UnknownHostException: spark-1527629824987-driver-
> svc.spark.svc
>
> at java.net.InetAddress.getAllByName0(InetAddress.
> java:1280)
>
> at java.net.InetAddress.getAllByName(InetAddress.java:
> 1192)
>
> at java.net.InetAddress.getAllByName(InetAddress.java:
> 1126)
>
> at java.net.InetAddress.getByName(InetAddress.java:1076)
>
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.
> java:146)
>
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.
> java:143)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at io.netty.util.internal.SocketUtils.addressByName(
> SocketUtils.java:143)
>
> at io.netty.resolver.DefaultNameResolver.doResolve(
> DefaultNameResolver.java:43)
>
> at io.netty.resolver.SimpleNameResolver.resolve(
> SimpleNameResolver.java:63)
>
> at io.netty.resolver.SimpleNameResolver.resolve(
> SimpleNameResolver.java:55)
>
> at io.netty.resolver.InetSocketAddressResolver.doResolve(
> InetSocketAddressResolver.java:57)
>
> at io.netty.resolver.InetSocketAddressResolver.doResolve(
> InetSocketAddressResolver.java:32)
>
> at io.netty.resolver.AbstractAddressResolver.resolve(
> AbstractAddressResolver.java:108)
>
> at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(
> Bootstrap.java:208)
>
> at io.netty.bootstrap.Bootstrap.
> access$000(Bootstrap.java:49)
>
> at io.netty.bootstrap.Bootstrap$
> 1.operationComplete(Bootstrap.java:188)
>
> at io.netty.bootstrap.Bootstrap$
> 

Re: Spark driver pod garbage collection

2018-05-23 Thread Anirudh Ramanathan
There's a flag to the controller manager that is in charge of retention
policy for terminated or completed pods.

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/#options
--terminated-pod-gc-threshold int32 Default: 12500
Number of terminated pods that can exist before the terminated pod garbage
collector starts deleting terminated pods. If <= 0, the terminated pod
garbage collector is disabled.

On Wed, May 23, 2018, 8:34 AM purna pradeep  wrote:

> Hello,
>
> Currently I observe dead pods are not getting garbage collected (aka spark
> driver pods which have completed execution). So pods could sit in the
> namespace for weeks potentially. This makes listing, parsing, and reading
> pods slower and well as having junk sit on the cluster.
>
> I believe minimum-container-ttl-duration kubelet flag is by default set to
> 0 minute but I don’t see the completed spark driver pods are garbage
> collected
>
> Do I need to set any flag explicitly @ kubelet level?
>
>


Re: Spark driver pod eviction Kubernetes

2018-05-22 Thread Anirudh Ramanathan
I think a pod disruption budget might actually work here. It can select the
spark driver pod using a label. Using that with a minAvailable value that's
appropriate here could do it.

In a more general sense, we do plan on some future work to support driver
recovery which should help long running jobs to restart without losing
progress.

On Tue, May 22, 2018, 7:55 AM purna pradeep  wrote:

> Hi,
>
> What would be the recommended approach to wait for spark driver pod to
> complete the currently running job before it gets evicted to new nodes
> while maintenance on the current node is goingon (kernel upgrade,hardware
> maintenance etc..) using drain command
>
> I don’t think I can use PoDisruptionBudget as Spark pods deployment
> yaml(s) is taken by Kubernetes
>
> Please suggest !
>
>
>


Re: Structured Streaming on Kubernetes

2018-04-13 Thread Anirudh Ramanathan
+ozzieba who was experimenting with streaming workloads recently. +1 to
what Matt said. Checkpointing and driver recovery is future work.
Structured streaming is important, and it would be good to get some
production experiences here and try and target improving the feature's
support on K8s for 2.4/3.0.


On Fri, Apr 13, 2018 at 11:55 AM Matt Cheah <mch...@palantir.com> wrote:

> We don’t provide any Kubernetes-specific mechanisms for streaming, such as
> checkpointing to persistent volumes. But as long as streaming doesn’t
> require persisting to the executor’s local disk, streaming ought to work
> out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local
> directories.
>
>
>
> However, I’m unaware of any specific use of streaming with the Spark on
> Kubernetes integration right now. Would be curious to get feedback on the
> failover behavior right now.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Tathagata Das <t...@databricks.com>
> *Date: *Friday, April 13, 2018 at 1:27 AM
> *To: *Krishna Kalyan <krishnakaly...@gmail.com>
> *Cc: *user <user@spark.apache.org>
> *Subject: *Re: Structured Streaming on Kubernetes
>
>
>
> Structured streaming is stable in production! At Databricks, we and our
> customers collectively process almost 100s of billions of records per day
> using SS. However, we are not using kubernetes :)
>
>
>
> Though I don't think it will matter too much as long as kubes are
> correctly provisioned+configured and you are checkpointing to HDFS (for
> fault-tolerance guarantees).
>
>
>
> TD
>
>
>
> On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <krishnakaly...@gmail.com>
> wrote:
>
> Hello All,
>
> We were evaluating Spark Structured Streaming on Kubernetes (Running on
> GCP). It would be awesome if the spark community could share their
> experience around this. I would like to know more about you production
> experience and the monitoring tools you are using.
>
>
>
> Since spark on kubernetes is a relatively new addition to spark, I was
> wondering if structured streaming is stable in production. We were also
> evaluating Apache Beam with Flink.
>
>
>
> Regards,
>
> Krishna
>
>
>
>
>
>

-- 
Anirudh Ramanathan


Re: Spark Kubernetes Volumes

2018-04-12 Thread Anirudh Ramanathan
There's a JIRA SPARK-23529
<https://issues.apache.org/jira/browse/SPARK-23529> that deals with
mounting hostpath volumes.
I propose we extend that PR/JIRA to encompass all the different volume
types and allow mounting them into the driver/executors.

On Thu, Apr 12, 2018 at 10:55 AM Yinan Li <liyinan...@gmail.com> wrote:

> Hi Marius,
>
> Spark on Kubernetes does not yet support mounting user-specified volumes
> natively. But mounting volume is supported in
> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator. Please see
> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#mounting-volumes
> .
>
> On Thu, Apr 12, 2018 at 7:50 AM, Marius <m.die0...@gmail.com> wrote:
>
>> Hey,
>>
>> i have a question regarding the Spark on Kubernetes feature. I would like
>> to mount a pre-populated Kubernetes volume into the execution pods of
>> Spark. One of my tools that i invoke using the Sparks pipe command requires
>> these files to be available on a POSIX compatible FS and they are too large
>> to justify copying them around using addFile. If this is not possible i
>> would like to know if the community be interested in such a feature.
>>
>> Cheers
>>
>> Marius
>>
>> ---------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

-- 
Anirudh Ramanathan


Re: handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2018-03-08 Thread Anirudh Ramanathan
You don't need to create the init-container. It's an implementation detail.
If you provide a remote uri, and
specify spark.kubernetes.container.image=, Spark *internally*
will add the init container to the pod spec for you.
*If *for some reason, you want to customize the init container image, you
can choose to do that using the specific options, but I don't think this is
necessary in most scenarios. The init container image, driver and executor
images can be identical by default.


On Thu, Mar 8, 2018 at 6:52 AM purna pradeep <purna2prad...@gmail.com>
wrote:

> Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker
> container image
>
> The challenge im facing is application have a mainapplication.jar and
> other dependency files & jars which are located in Remote location like AWS
> s3 ,but as per spark 2.3 documentation there is something called kubernetes
> init-container to download remote dependencies but in this case im not
> creating any Podspec to include init-containers in kubernetes, as per
> documentation Spark 2.3 spark/kubernetes internally creates Pods
> (driver,executor) So not sure how can i use init-container for spark-submit
> when there are remote dependencies.
>
>
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-remote-dependencies
>
> Please suggest
>


-- 
Anirudh Ramanathan