Agreed with Marcelo that this is not a unique problem to Spark on k8s. For
a lot of organizations, hosting dependencies on HDFS seems the choice. One
option that the Spark Operator
<https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/tree/master/sparkctl>
does is to automatically upload application dependencies on the submission
client machine to a user-specified S3 or GCS bucket and substitute the
local dependencies with the remote ones. But regardless of which option to
use to stage local dependencies, it generally only works for small ones
like jars or small config/data files.

Yinan

On Fri, Oct 5, 2018 at 10:28 AM Marcelo Vanzin <van...@cloudera.com.invalid>
wrote:

> On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse <rve...@dotnetrdf.org> wrote:
> > Ideally this would all just be handled automatically for users in the
> way that all other resource managers do
>
> I think you're giving other resource managers too much credit. In
> cluster mode, only YARN really distributes local dependencies, because
> YARN has that feature (its distributed cache) and Spark just uses it.
>
> Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
> anything similar on the Mesos side.
>
> There are things that could be done; e.g. if you have HDFS you could
> do a restricted version of what YARN does (upload files to HDFS, and
> change the "spark.jars" and "spark.files" URLs to point to HDFS
> instead). Or you could turn the submission client into a file server
> that the cluster-mode driver downloads files from - although that
> requires connectivity from the driver back to the client.
>
> Neither is great, but better than not having that feature.
>
> Just to be clear: in client mode things work right? (Although I'm not
> really familiar with how client mode works in k8s - never tried it.)
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to