mosche commented on issue #23568: URL: https://github.com/apache/beam/issues/23568#issuecomment-1286746306
Unfortunately classpath issues are a common trouble both with Beam and Spark. From Beam's perspective this is a `NO FIX`, downgrading Jackson to the old version used by Spark 2.4 is obviously not an option. Spark offers a workaround to handle this using [`userClassPathFirst`](https://spark.apache.org/docs/latest/configuration.html): Property Name | Default | Meaning | Since Version -- | -- | -- | -- | spark.driver.userClassPathFirst | false | (Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only. | 1.3.0 | | spark.executor.userClassPathFirst | false | (Experimental) Same functionality as spark.driver.userClassPathFirst, but applied to executor instances. | 1.3.0 When enabling `userClassPathFirst` it's critical to remove some dependencies from the application uber jar. Specifically these are spark, hadoop, scala, slf4j, log4j dependencies. Otherwise related classes would be loaded by a separate classloader on the user classpath, making them incompatible with the ones loaded by Spark's system classpath loader (even if versions match exactly). E.g., if using `maven-shade-plugin`, this can be done by adding the following `configuration` to exclude respective artifacts (by groupId). ```xml <artifactSet> <excludes> <exclude>log4j</exclude> <exclude>org.slf4j</exclude> <exclude>io.dropwizard.metrics</exclude> <exclude>org.scala-lang</exclude> <exclude>org.scala-lang.modules</exclude> <exclude>org.apache.spark</exclude> <exclude>org.apache.hadoop</exclude> </excludes> </artifactSet> ``` Additionally you might have to explicitly bump the version of some Jackson modules to match the version used by Beam. ```xml <dependency> <groupId>com.fasterxml.jackson.module</groupId> <artifactId>jackson-module-scala_2.11</artifactId> <version>${jackson.version}</version> <scope>runtime</scope> </dependency> <dependency> <groupId>com.fasterxml.jackson.module</groupId> <artifactId>jackson-module-jaxb-annotations</artifactId> <version>${jackson.version}</version> <scope>runtime</scope> </dependency> <dependency> <groupId>com.fasterxml.jackson.module</groupId> <artifactId>jackson-module-paranamer</artifactId> <version>${jackson.version}</version> <scope>runtime</scope> </dependency> ``` Step-by-step example 1. Generate the [WordCount quickstart example](https://beam.apache.org/get-started/quickstart-java/) 2. Modify the `maven-shade-plugin` configuration as mentioned above. 3. Add the Jackson dependencies mentioned above to the `spark-runner` profile. 4. Build the uber jar using the `spark-runner` profile ``` mvn package -Pspark-runner ``` 5. Copy it to your Spark cluster (or shared storage) 6. Run spark-submit on the cluster ``` spark-submit --class org.apache.beam.examples.WordCount \ --master spark://<SPARK_MASTER>:7077 \ --deploy-mode cluster \ --conf spark.driver.userClassPathFirst=true \ --conf spark.executor.userClassPathFirst=true \ <STORAGE_PATH>/word-count-beam-bundled-0.1.jar \ --runner=SparkRunner \ --output=<RESULT_PATH>/counts ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
