mosche commented on issue #23568:
URL: https://github.com/apache/beam/issues/23568#issuecomment-1286746306

   Unfortunately classpath issues are a common trouble both with Beam and Spark.
   From Beam's perspective this is a `NO FIX`, downgrading Jackson to the old 
version used by Spark 2.4 is obviously not an option.
   
   Spark offers a workaround to handle this using 
[`userClassPathFirst`](https://spark.apache.org/docs/latest/configuration.html):
   
   Property Name | Default | Meaning | Since Version
   -- | -- | -- | --
   | spark.driver.userClassPathFirst | false | (Experimental) Whether to give 
user-added jars precedence over Spark's own jars when loading classes in the 
driver. This feature can be used to mitigate conflicts between Spark's 
dependencies and user dependencies. It is currently an experimental feature. 
This is used in cluster mode only. | 1.3.0 |
   | spark.executor.userClassPathFirst | false | (Experimental) Same 
functionality as spark.driver.userClassPathFirst, but applied to executor 
instances. | 1.3.0
   
   When enabling `userClassPathFirst` it's critical to remove some dependencies 
from the application uber jar. Specifically these are spark, hadoop, scala, 
slf4j, log4j dependencies.  Otherwise related classes would be loaded by a 
separate classloader on the user classpath, making them incompatible with the 
ones loaded by Spark's system classpath loader (even if versions match exactly).
   
   E.g., if using `maven-shade-plugin`, this can be done by adding the 
following `configuration` to exclude respective artifacts (by groupId).
   
   ```xml
   <artifactSet>
     <excludes>
       <exclude>log4j</exclude>
       <exclude>org.slf4j</exclude>
       <exclude>io.dropwizard.metrics</exclude>
       <exclude>org.scala-lang</exclude>
       <exclude>org.scala-lang.modules</exclude>
       <exclude>org.apache.spark</exclude>
       <exclude>org.apache.hadoop</exclude>
     </excludes>
   </artifactSet>
   ```
   
   Additionally you might have to explicitly bump the version of some Jackson 
modules to match the version used by Beam.
   ```xml
   <dependency>
     <groupId>com.fasterxml.jackson.module</groupId>
     <artifactId>jackson-module-scala_2.11</artifactId>
     <version>${jackson.version}</version>
     <scope>runtime</scope>
   </dependency>
   <dependency>
     <groupId>com.fasterxml.jackson.module</groupId>
     <artifactId>jackson-module-jaxb-annotations</artifactId>
     <version>${jackson.version}</version>
     <scope>runtime</scope>
   </dependency>
   <dependency>
     <groupId>com.fasterxml.jackson.module</groupId>
     <artifactId>jackson-module-paranamer</artifactId>
     <version>${jackson.version}</version>
     <scope>runtime</scope>
   </dependency>
   ```
   
   Step-by-step example
   1. Generate the [WordCount quickstart 
example](https://beam.apache.org/get-started/quickstart-java/)
   2. Modify the `maven-shade-plugin` configuration as mentioned above.
   3. Add the Jackson dependencies mentioned above to the `spark-runner` 
profile.
   4. Build the uber jar using the `spark-runner` profile
       ```
       mvn package -Pspark-runner
       ```
   5. Copy it to your Spark cluster (or shared storage)
   6. Run spark-submit on the cluster
       ```
       spark-submit --class org.apache.beam.examples.WordCount \
         --master spark://<SPARK_MASTER>:7077 \
         --deploy-mode cluster \
         --conf spark.driver.userClassPathFirst=true \
         --conf spark.executor.userClassPathFirst=true \
         <STORAGE_PATH>/word-count-beam-bundled-0.1.jar \
         --runner=SparkRunner \
         --output=<RESULT_PATH>/counts  
      ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to