Hi, Thanks all for the replies.
I am adding the Spark dev list as well - as I think this might be an issue that needs to be addressed. The options presented here will get the jars - but they don't help us with dependencies conflicts... For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 - uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will result with both conflicting. How can one add packages to their Spark (during the build process of the Docker image) - without causing unresolved conflicts? Thanks! Nimrod On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes <marley.ha...@gmail.com> wrote: > Herewith a more fleshed out example: > > An example of a *build.gradle.kts* file: > > plugins { > id("java") > } > > val sparkJarsDir = > objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars")) > > repositories { > mavenCentral() > } > > val sparkJars: Configuration by configurations.creating { > isCanBeResolved = true > isCanBeConsumed = false > } > > dependencies { > sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0") > } > > val copySparkJars by tasks.registering(Copy::class) { > group = "build" > description = "Copies the appropriate jars to the configured spark jars > directory" > from(sparkJars) > into(sparkJarsDir) > } > > Now, the *Dockerfile*: > > FROM spark:3.5.3-scala2.12-java17-ubuntu > > USER root > > COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/" > > USER spark > > > Kind regards, > > Damien > > On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes <marley.ha...@gmail.com> > wrote: > >> The simplest solution that I have found in solving this was to use Gradle >> (or Maven, if you prefer), and list the dependencies that I want copied to >> $SPARK_HOME/jars as project dependencies. >> >> Summary of steps to follow: >> >> 1. Using your favourite build tool, declare a dependency on your required >> packages. >> 2. Write your Dockerfile, with or without the Spark binaries inside it. >> 3. Using your build tool to copy the dependencies to a location that the >> Docker daemon can access. >> 4. Copy the dependencies into the correct directory. >> 5. Ensure those files have the correct permissions. >> >> In my opinion, it is pretty easy to do this with Gradle. >> >> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek <ofek.nim...@gmail.com>: >> >>> Hi all, >>> >>> I am creating a base Spark image that we are using internally. >>> We need to add some packages to the base image: >>> spark:3.5.1-scala2.12-java17-python3-r-ubuntu >>> >>> Of course I do not want to Start Spark with --packages "..." - as it is >>> not efficient at all - I would like to add the needed jars to the image. >>> >>> Ideally, I would have add to my image something that will add the needed >>> packages - something like: >>> >>> RUN $SPARK_HOME/bin/add-packages "..." >>> >>> But AFAIK there is no such option. >>> >>> Other than running Spark to add those packages and then creating the >>> image - or running Spark always with --packages "..." - what can I do? >>> Is there a way to run just the code that is run by the --package command >>> - without running Spark, so I can add the needed dependencies to my image? >>> >>> I am sure this is something that I am not the only one nor the first one >>> to encounter... >>> >>> Thanks! >>> Nimrod >>> >>> >>> >>