Hi,

Thanks all for the replies.

I am adding the Spark dev list as well - as I think this might be an issue
that needs to be addressed.

The options presented here will get the jars - but they don't help us with
dependencies conflicts...
For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 -
uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will
result with both conflicting.

How can one add packages to their Spark (during the build process of the
Docker image) - without causing unresolved conflicts?

Thanks!
Nimrod


On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes <marley.ha...@gmail.com> wrote:

> Herewith a more fleshed out example:
>
> An example of a *build.gradle.kts* file:
>
> plugins {
>     id("java")
> }
>
> val sparkJarsDir = 
> objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars"))
>
> repositories {
>     mavenCentral()
> }
>
> val sparkJars: Configuration by configurations.creating {
>     isCanBeResolved = true
>     isCanBeConsumed = false
> }
>
> dependencies {
>     sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0")
> }
>
> val copySparkJars by tasks.registering(Copy::class) {
>     group = "build"
>     description = "Copies the appropriate jars to the configured spark jars 
> directory"
>     from(sparkJars)
>     into(sparkJarsDir)
> }
>
> Now, the *Dockerfile*:
>
> FROM spark:3.5.3-scala2.12-java17-ubuntu
>
> USER root
>
> COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/"
>
> USER spark
>
>
> Kind regards,
>
> Damien
>
> On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes <marley.ha...@gmail.com>
> wrote:
>
>> The simplest solution that I have found in solving this was to use Gradle
>> (or Maven, if you prefer), and list the dependencies that I want copied to
>> $SPARK_HOME/jars as project dependencies.
>>
>> Summary of steps to follow:
>>
>> 1. Using your favourite build tool, declare a dependency on your required
>> packages.
>> 2. Write your Dockerfile, with or without the Spark binaries inside it.
>> 3. Using your build tool to copy the dependencies to a location that the
>> Docker daemon can access.
>> 4. Copy the dependencies into the correct directory.
>> 5. Ensure those files have the correct permissions.
>>
>> In my opinion, it is pretty easy to do this with Gradle.
>>
>> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek <ofek.nim...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I am creating a base Spark image that we are using internally.
>>> We need to add some packages to the base image:
>>> spark:3.5.1-scala2.12-java17-python3-r-ubuntu
>>>
>>> Of course I do not want to Start Spark with --packages "..." - as it is
>>> not efficient at all - I would like to add the needed jars to the image.
>>>
>>> Ideally, I would have add to my image something that will add the needed
>>> packages - something like:
>>>
>>> RUN $SPARK_HOME/bin/add-packages "..."
>>>
>>> But AFAIK there is no such option.
>>>
>>> Other than running Spark to add those packages and then creating the
>>> image - or running Spark always with --packages "..."  - what can I do?
>>> Is there a way to run just the code that is run by the --package command
>>> - without running Spark, so I can add the needed dependencies to my image?
>>>
>>> I am sure this is something that I am not the only one nor the first one
>>> to encounter...
>>>
>>> Thanks!
>>> Nimrod
>>>
>>>
>>>
>>

Reply via email to