Re: [PR] [SPARK-47152][SQL][BUILD] Provide `CodeHaus Jackson` dependencies via a new optional directory [spark]

via GitHub Fri, 23 Feb 2024 21:53:28 -0800


dongjoon-hyun commented on code in PR #45237:
URL: https://github.com/apache/spark/pull/45237#discussion_r1501334313



##########
dev/make-distribution.sh:
##########
@@ -189,6 +189,12 @@ echo "Build flags: $@" >> "$DISTDIR/RELEASE"
 # Copy jars
 cp "$SPARK_HOME"/assembly/target/scala*/jars/* "$DISTDIR/jars/"
 
+# Only create the hive-jackson directory if they exist.
+for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
+  mkdir -p "$DISTDIR"/hive-jackson
+  mv $f "$DISTDIR"/hive-jackson/
+done

Review Comment:
   There are 5 main benefits like `yarn` directory, @viirya .
   
   1. The following Apache Spark deamons (with uses `bin/spark-daemon.sh 
start`) will ignore `hive-jackson` directory.
       - Spark Master
       - Spark Worker
       - Spark History Server
   ```
   $ grep 'spark-daemon.sh start' *
   start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start 
$CLASS 1 "$@"
   start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
   start-worker.sh:  "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 
$WORKER_NUM \
   ```
   
   2. **Recoverability**: The AS-IS Spark 3 users can achieve the same goal if 
they delete those two files from Spark's `jar` directory manually. However, 
it's difficult to recover the deleted files when a production job fails due to 
Hive UDF. This PR provides more robust and safe way with a configuration.
   
   3. **Communication**: We (and the security team) can easily communicate that 
`hive-jackson` is not used like `yarn` directory because it's physically split 
from the distribution. Also, they can delete the directory easily (if they 
need) without knowing the details of dependency lists inside that directory.
   
   4. **Robustness**: If Apache Spark have everything in `jars`, it's difficult 
to prevent them from loading. Of course, we may choose a tricky way to filter 
out from class file lists via naming pattern. It's still less robust in a long 
term perspective.
   
   5. **Compatibility with `hive-jackson-provided`**:  With the existing 
`hive-jackson-provided`, this PR provides a cleaner injection point for the 
provided dependencies. For example, the custom build Jackson dependencies can 
be placed in `hive-jackson` (after they create this) instead of `jars`. We are 
very reluctant if someone put their custom jar files into Apache Spark's `jars` 
directory directly. `hive-jackson` directory could be more recommendable way 
than copying into Spark's `jars` directory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47152][SQL][BUILD] Provide `CodeHaus Jackson` dependencies via a new optional directory [spark]

Reply via email to