sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-929571635


   Thanks @JoshRosen ! These are some great analysis!
   
   > I think we'll also run into similar problems in the Maven build. According 
to Maven's build lifecycle docs:
   
   I've completely missed this 🤦 . Yes adding the `hive-shaded` module in Spark 
will not be a good idea given the above reasons on SBT and Maven test 
lifecycle, and now I understand why other projects put the shaded library in a 
different repo :)
   
   Let me spend more time to revisit the following two paths:
   1. shade all the dependencies in Hive (e.g., via `hive-exec` fat jar) and 
make a new release, so Spark can start using that.
   2. create a ASF repo such as `spark-thirdparty` following the examples from 
HBase & Hadoop. This needs community discussion as you mentioned, and I'm not 
sure how much more burden it will add to Spark's maintenance procedure.
   
   > There's a tricky corner-case if a user has manually built a metastore 
classpath which includes only the dependencies not already provided by Spark
   
   Thanks for the detailed explanation on how the `IsolatedClientLoader` works, 
and I agree this is a minor issue we should be aware of. We can either put 
something on the release notes, or perhaps exclude unshaded Guava jar 
completely from the Spark distribution (for `hadoop-3.2`). Currently this 
appears to be blocked by the `curator-client` dependency as discussed earlier 
in the PR, and perhaps there is still a way to ship only shaded Guava (from 
`network-common`) with those few classes required by `curator-client` excluded 
from relocation.
   
   > One more consideration: what about Hadoop 2.7 builds?
   
   Another good question :) You are right that Hadoop 2.7 still uses unshaded 
Guava, while Hadoop 3.3.1 has switched to use shaded Guava via HADOOP-17288. In 
addition Spark is using shaded Hadoop client from HADOOP-11804 which further 
relocates other Hadoop dependencies so they won't pollute Spark's classpath. 
   
   I think one approach is to keep Guava 14.0.1 for `hadoop-2.7` profile so 
everything still stay the same there. This at least will unblock us from 
upgrading Guava for the default `hadoop-3.2` profile, and make sure all the 
published Spark artifacts will get the newer version of Guava. Also the 
aforementioned idea of excluding unshaded Guava from Spark distribution will 
only apply for the latter.
   
   A crazier idea is to shade Hadoop 2.7 also if we are going with the 
`spark-thirdparty` approach but I'm not sure whether it's worth it given we are 
going to deprecate `hadoop-2.7` eventually.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to