Hi,

That's on you as the maintainer of the derived image to ensure that your
added dependencies do not conflict with Spark's dependencies. Speaking from
experience, there are several ways to achieve this:

1. Ensure you're using packages that contain shaded and relocated packages,
if possible.
2. If you're creating packages of your own, ensure your dependencies (and
their transitive dependencies) are compatible with the ones that Spark
uses. Otherwise, create your own shaded packages if your packages require
different versions.

Both build systems can aid you in this. Both build systems also have the
ability to give you a dependency report. Maven has the enforcer plugin with
the dependency convergence rule to prevent differing versions, this does
require you to declare Spark dependencies in the pom (as *provided* scope),
though the dependency convergence rule doesn't work on the provided scope
by default. Once you enable the *provided* scope, you'll start encountering
a lot of dependency issues in the transitive issues that Spark uses. You'll
have to work through each of these to get the build to pass.

<properties>
    <spark.version>3.4.3</spark.version>
    <scala.compat.version>2.12</scala.compat.version>
    <scala.minor.version>11</scala.minor.version>
    
<scala.version>${scala.compat.version}.${scala.minor.version}</scala.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.36</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.compat.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-enforcer-plugin</artifactId>
            <version>3.5.0</version>
            <executions>
                <execution>
                    <id>enforce</id>
                    <configuration>
                        <rules>
                            <dependencyConvergence>
                                <excludedScopes>
                                    <!-- without this, provided and
test scopes are ignored by default -->
                                    <scope>test</scope>
                                </excludedScopes>
                            </dependencyConvergence>
                        </rules>
                    </configuration>
                    <goals>
                        <goal>enforce</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-dependency-plugin</artifactId>
            <version>3.8.0</version>
            <executions>
                <execution>
                    <id>copy-dependencies</id>
                    <phase>package</phase>
                    <goals>
                        <goal>copy-dependencies</goal>
                    </goals>
                    <configuration>

<outputDirectory>${project.build.directory}/spark-jars</outputDirectory>
                        <overWriteReleases>false</overWriteReleases>
                        <overWriteSnapshots>false</overWriteSnapshots>
                        <overWriteIfNewer>true</overWriteIfNewer>
                        <includeScope>runtime</includeScope>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>


Op do 17 okt. 2024 13:51 schreef Nimrod Ofek <ofek.nim...@gmail.com>:

>
> Hi,
>
> Thanks all for the replies.
>
> I am adding the Spark dev list as well - as I think this might be an issue
> that needs to be addressed.
>
> The options presented here will get the jars - but they don't help us with
> dependencies conflicts...
> For example - com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.0 -
> uses Guava 30 while Spark 3.5.3 uses Guava 14 - the options here will
> result with both conflicting.
>
> How can one add packages to their Spark (during the build process of the
> Docker image) - without causing unresolved conflicts?
>
> Thanks!
> Nimrod
>
>
> On Tue, Oct 15, 2024 at 6:53 PM Damien Hawes <marley.ha...@gmail.com>
> wrote:
>
>> Herewith a more fleshed out example:
>>
>> An example of a *build.gradle.kts* file:
>>
>> plugins {
>>     id("java")
>> }
>>
>> val sparkJarsDir = 
>> objects.directoryProperty().convention(layout.buildDirectory.dir("sparkJars"))
>>
>> repositories {
>>     mavenCentral()
>> }
>>
>> val sparkJars: Configuration by configurations.creating {
>>     isCanBeResolved = true
>>     isCanBeConsumed = false
>> }
>>
>> dependencies {
>>     sparkJars("com.fasterxml.jackson.core:jackson-databind:2.18.0")
>> }
>>
>> val copySparkJars by tasks.registering(Copy::class) {
>>     group = "build"
>>     description = "Copies the appropriate jars to the configured spark jars 
>> directory"
>>     from(sparkJars)
>>     into(sparkJarsDir)
>> }
>>
>> Now, the *Dockerfile*:
>>
>> FROM spark:3.5.3-scala2.12-java17-ubuntu
>>
>> USER root
>>
>> COPY --chown=spark:spark build/sparkJars/* "$SPARK_HOME/jars/"
>>
>> USER spark
>>
>>
>> Kind regards,
>>
>> Damien
>>
>> On Tue, Oct 15, 2024 at 4:19 PM Damien Hawes <marley.ha...@gmail.com>
>> wrote:
>>
>>> The simplest solution that I have found in solving this was to use
>>> Gradle (or Maven, if you prefer), and list the dependencies that I want
>>> copied to $SPARK_HOME/jars as project dependencies.
>>>
>>> Summary of steps to follow:
>>>
>>> 1. Using your favourite build tool, declare a dependency on your
>>> required packages.
>>> 2. Write your Dockerfile, with or without the Spark binaries inside it.
>>> 3. Using your build tool to copy the dependencies to a location that the
>>> Docker daemon can access.
>>> 4. Copy the dependencies into the correct directory.
>>> 5. Ensure those files have the correct permissions.
>>>
>>> In my opinion, it is pretty easy to do this with Gradle.
>>>
>>> Op di 15 okt. 2024 15:28 schreef Nimrod Ofek <ofek.nim...@gmail.com>:
>>>
>>>> Hi all,
>>>>
>>>> I am creating a base Spark image that we are using internally.
>>>> We need to add some packages to the base image:
>>>> spark:3.5.1-scala2.12-java17-python3-r-ubuntu
>>>>
>>>> Of course I do not want to Start Spark with --packages "..." - as it is
>>>> not efficient at all - I would like to add the needed jars to the image.
>>>>
>>>> Ideally, I would have add to my image something that will add the
>>>> needed packages - something like:
>>>>
>>>> RUN $SPARK_HOME/bin/add-packages "..."
>>>>
>>>> But AFAIK there is no such option.
>>>>
>>>> Other than running Spark to add those packages and then creating the
>>>> image - or running Spark always with --packages "..."  - what can I do?
>>>> Is there a way to run just the code that is run by the --package
>>>> command - without running Spark, so I can add the needed dependencies to my
>>>> image?
>>>>
>>>> I am sure this is something that I am not the only one nor the first
>>>> one to encounter...
>>>>
>>>> Thanks!
>>>> Nimrod
>>>>
>>>>
>>>>
>>>

Reply via email to