If we can separate concerns those out, that might make sense in the short
term IMO.
There are several benefits to reusing spark-submit and spark-class as you
pointed out previously,
so, we should be looking to leverage those irrespective of how we do
dependency management -
in the interest of conformance with the other cluster managers.

I like the idea of passing arguments through in a way that it doesn't
trigger the dependency management code for now.
In the interest of time for 2.3, if we could target the just that (and
revisit the init containers afterwards),
there should be enough time to make the change, test and release with
confidence.

On Wed, Jan 10, 2018 at 3:45 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan
> <ramanath...@google.com> wrote:
> > We can start by getting a PR going perhaps, and start augmenting the
> > integration testing to ensure that there are no surprises - with/without
> > credentials, accessing GCS, S3 etc as well.
> > When we get enough confidence and test coverage, let's merge this in.
> > Does that sound like a reasonable path forward?
>
> I think it's beneficial to separate this into two separate things as
> far as discussion goes:
>
> - using spark-submit: the code should definitely be starting the
> driver using spark-submit, and potentially the executor using
> spark-class.
>
> - separately, we can decide on whether to keep or remove init containers.
>
> Unfortunately, code-wise, those are not separate. If you get rid of
> init containers, my current p.o.c. has most of the needed changes
> (only lightly tested).
>
> But if you keep init containers, you'll need to mess with the
> configuration so that spark-submit never sees spark.jars /
> spark.files, so it doesn't trigger its dependency download code. (YARN
> does something similar, btw.) That will surely mean different changes
> in the current k8s code (which I wanted to double check anyway because
> I remember seeing some oddities related to those configs in the logs).
>
> To comment on one point made by Andrew:
> > there's almost a parallel here with spark.yarn.archive, where that
> configures the cluster (YARN) to do distribution pre-runtime
>
> That's more of a parallel to the docker image; spark.yarn.archive
> points to a jar file with Spark jars in it so that YARN can make Spark
> available to the driver / executors running in the cluster.
>
> Like the docker image, you could include other stuff that is not
> really part of standard Spark in that archive too, or even not have
> Spark at all there, if you want things to just fail. :-)
>
> --
> Marcelo
>



-- 
Anirudh Ramanathan

Reply via email to