dongjoon-hyun commented on code in PR #45237:
URL: https://github.com/apache/spark/pull/45237#discussion_r1501334313
##########
dev/make-distribution.sh:
##########
@@ -189,6 +189,12 @@ echo "Build flags: $@" >> "$DISTDIR/RELEASE"
# Copy jars
cp "$SPARK_HOME"/assembly/target/scala*/jars/* "$DISTDIR/jars/"
+# Only create the hive-jackson directory if they exist.
+for f in "$DISTDIR"/jars/jackson-*-asl-*.jar; do
+ mkdir -p "$DISTDIR"/hive-jackson
+ mv $f "$DISTDIR"/hive-jackson/
+done
Review Comment:
There are 5 main benefits like `yarn` directory, @viirya .
1. The following Apache Spark deamons (with uses `bin/spark-daemon.sh
start`) will ignore `hive-jackson` directory.
- Spark Master
- Spark Worker
- Spark History Server
```
$ grep 'spark-daemon.sh start' *
start-history-server.sh:exec "${SPARK_HOME}/sbin"/spark-daemon.sh start
$CLASS 1 "$@"
start-master.sh:"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
start-worker.sh: "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS
$WORKER_NUM \
```
2. **Recoverability**: The AS-IS Spark 3 users can achieve the same goal if
they delete those two files from Spark's `jar` directory manually. However,
it's difficult to recover the deleted files when a production job fails due to
Hive UDF. This PR provides more robust and safe way with a configuration.
3. **Communication**: We (and the security team) can easily communicate that
`hive-jackson` is not used like `yarn` directory because it's physically split
from the distribution. Also, they can delete the directory easily (if they
need) without knowing the details of dependency lists inside that directory.
4. **Robustness**: If Apache Spark have everything in `jars`, it's difficult
to prevent them from loading. Of course, we may choose a tricky way to filter
out from class file lists via naming pattern. It's still less robust in a long
term perspective.
5. **Compatibility with `hive-jackson-provided`**: With the existing
`hive-jackson-provided`, this PR provides a cleaner injection point for the
provided dependencies. For example, the custom build Jackson dependencies can
be placed in `hive-jackson` (after they create this) instead of `jars`. We are
very reluctant if someone put their custom jar files into Apache Spark's `jars`
directory directly. `hive-jackson` directory could be more recommendable way
than copying into Spark's `jars` directory.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]