sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-929571635
Thanks @JoshRosen ! These are some great analysis!
> I think we'll also run into similar problems in the Maven build. According
to Maven's build lifecycle docs:
I've completely missed this 🤦 . Yes adding the `hive-shaded` module in Spark
will not be a good idea given the above reasons on SBT and Maven test
lifecycle, and now I understand why other projects put the shaded library in a
different repo :)
Let me spend more time to revisit the following two paths:
1. shade all the dependencies in Hive (e.g., via `hive-exec` fat jar) and
make a new release, so Spark can start using that.
2. create a ASF repo such as `spark-thirdparty` following the examples from
HBase & Hadoop. This needs community discussion as you mentioned, and I'm not
sure how much more burden it will add to Spark's maintenance procedure.
> There's a tricky corner-case if a user has manually built a metastore
classpath which includes only the dependencies not already provided by Spark
Thanks for the detailed explanation on how the `IsolatedClientLoader` works,
and I agree this is a minor issue we should be aware of. We can either put
something on the release notes, or perhaps exclude unshaded Guava jar
completely from the Spark distribution (for `hadoop-3.2`). Currently this
appears to be blocked by the `curator-client` dependency as discussed earlier
in the PR, and perhaps there is still a way to ship only shaded Guava (from
`network-common`) with those few classes required by `curator-client` excluded
from relocation.
> One more consideration: what about Hadoop 2.7 builds?
Another good question :) You are right that Hadoop 2.7 still uses unshaded
Guava, while Hadoop 3.3.1 has switched to use shaded Guava via HADOOP-17288. In
addition Spark is using shaded Hadoop client from HADOOP-11804 which further
relocates other Hadoop dependencies so they won't pollute Spark's classpath.
I think one approach is to keep Guava 14.0.1 for `hadoop-2.7` profile so
everything still stay the same there. This at least will unblock us from
upgrading Guava for the default `hadoop-3.2` profile, and make sure all the
published Spark artifacts will get the newer version of Guava. Also the
aforementioned idea of excluding unshaded Guava from Spark distribution will
only apply for the latter.
A crazier idea is to shade Hadoop 2.7 also if we are going with the
`spark-thirdparty` approach but I'm not sure whether it's worth it given we are
going to deprecate `hadoop-2.7` eventually.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]