Folks, thanks for all the great input. Responding to various points raised:

 

Marcelo/Yinan/Felix – 

 

Yes, client mode will work.  The main JAR will be automatically distributed and 
--jars/--files specified dependencies are also distributed though for --files 
user code needs to use the appropriate Spark APIs to resolve the actual path 
i.e. SparkFiles.get()

 

However client mode can be awkward if you want to mix spark-submit distribution 
with mounting dependencies via volumes since you may need to ensure that 
dependencies appear at the same path both on the local submission client and 
when mounted into the executors.  This mainly applies to the case where user 
code does not use SparkFiles.get() and simply tries to access the path directly.

 

Marcelo/Stavros – 

 

Yes I did give the other resource managers too much credit.  From my past 
experience with Mesos and Standalone I had thought this wasn’t an issue but 
going back and looking at what we did for both of those it appears we were 
entirely reliant on the shared file system (whether HDFS, NFS or other POSIX 
compliant filesystems e.g. Lustre).

 

Since connectivity back to the client is a potential stumbling block for 
cluster mode I wander if it would be better to think in reverse i.e. rather 
than having the driver pull from the client have the client push to the driver 
pod?

 

You can do this manually yourself via kubectl cp so it should be possible to 
programmatically do this since it looks like this is just a tar piped into a 
kubectl exec.   This would keep the relevant logic in the Kubernetes specific 
client which may/may not be desirable depending on whether we’re looking to 
just fix this for K8S or more generally.  Of course there is probably a fair 
bit of complexity in making this work but does that sound like something worth 
exploring?

 

I hadn’t really considered the HA aspect, a first step would be to get the 
basics working and then look at the HA aspect.  Although if the above 
theoretical approach is practical that could simply be part of restarting the 
driver.

 

Rob

 

 

From: Felix Cheung <felixcheun...@hotmail.com>
Date: Sunday, 7 October 2018 at 23:00
To: Yinan Li <liyinan...@gmail.com>, Stavros Kontopoulos 
<stavros.kontopou...@lightbend.com>
Cc: Rob Vesse <rve...@dotnetrdf.org>, dev <dev@spark.apache.org>
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

Jars and libraries only accessible locally at the driver is fairly limited? 
Don’t you want the same on all executor?

 

 

 

From: Yinan Li <liyinan...@gmail.com>
Sent: Friday, October 5, 2018 11:25 AM
To: Stavros Kontopoulos
Cc: rve...@dotnetrdf.org; dev
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes 

 

> Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.) 

 

If the driver runs on the submission client machine, yes, it should just work. 
If the driver runs in a pod, however, it faces the same problem as in cluster 
mode.

 

Yinan

 

On Fri, Oct 5, 2018 at 11:06 AM Stavros Kontopoulos 
<stavros.kontopou...@lightbend.com> wrote:

@Marcelo is correct. Mesos does not have something similar. Only Yarn does due 
to the distributed cache thing. 

I have described most of the above in the the jira also there are some other 
options.

 

Best,

Stavros

 

On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin <van...@cloudera.com.invalid> 
wrote:

On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse <rve...@dotnetrdf.org> wrote:
> Ideally this would all just be handled automatically for users in the way 
> that all other resource managers do

I think you're giving other resource managers too much credit. In
cluster mode, only YARN really distributes local dependencies, because
YARN has that feature (its distributed cache) and Spark just uses it.

Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
anything similar on the Mesos side.

There are things that could be done; e.g. if you have HDFS you could
do a restricted version of what YARN does (upload files to HDFS, and
change the "spark.jars" and "spark.files" URLs to point to HDFS
instead). Or you could turn the submission client into a file server
that the cluster-mode driver downloads files from - although that
requires connectivity from the driver back to the client.

Neither is great, but better than not having that feature.

Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



 

Reply via email to