zsxwing opened a new pull request #29059:
URL: https://github.com/apache/spark/pull/29059


   ### What changes were proposed in this pull request?
   
   Force to initialize Hadoop VersionInfo in HiveExternalCatalog to make sure 
Hive can get the Hadoop version when using the isolated classloader.
   
   ### Why are the changes needed?
   
   This is a regression in Spark 3.0.0 because we switched the default Hive 
execution version from 1.2.1 to 2.3.7.
   
   Spark allows the user to set `spark.sql.hive.metastore.jars` to specify jars 
to access Hive Metastore. These jars are loaded by the isolated classloader. 
Because we also share Hadoop classes with the isolated classloader, the user 
doesn't need to add Hadoop jars to `spark.sql.hive.metastore.jars`, which means 
when we are using the isolated classloader, hadoop-common jar is not available 
in this case. If Hadoop VersionInfo is not initialized before we switch to the 
isolated classloader, and we try to initialize it using the isolated 
classloader (the current thread context classloader), it will fail and report 
`Unknown` which causes Hive to throw the following exception:
   
   ```
   08:49:33.242 ERROR org.apache.hadoop.hive.shims.ShimLoader: Error loading 
shims
   java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* 
format)
        at 
org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147)
        at 
org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122)
        at 
org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88)
        at 
org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377)
        at 
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268)
        at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
        at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
        at 
org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58)
        at 
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370)
        at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
        at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
        at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:219)
        at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:67)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548)
        at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
        at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
        at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
        at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080)
        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108)
        at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543)
        at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175)
        at 
org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:128)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301)
        at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431)
        at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324)
        at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:72)
        at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:71)
   ```
   
   Technically, This is indeed an issue of Hadoop VersionInfo which has been 
fixed: https://issues.apache.org/jira/browse/HADOOP-14067. But since we are 
still supporting old Hadoop versions, we should fix it.
   
   Why this issue starts to happen in Spark 3.0.0?
   
   In Spark 2.4.x, we use Hive 1.2.1 by default. It will trigger `VersionInfo` 
initialization in the static codes of `Hive` class. This will happen when we 
load `HiveClientImpl` class because `HiveClientImpl.clent` method refers to 
`Hive` class. At this moment, the thread context classloader is not using the 
isolcated classloader, so it can access hadoop-common jar on the classpath and 
initialize it correctly.
   
   In Spark 3.0.0, we use Hive 2.3.7. The static codes of `Hive` class are not 
accessing `VersionInfo` because of the change in 
https://issues.apache.org/jira/browse/HIVE-11657. Instead, accessing 
`VersionInfo` happens when creating a `Hive` object (See the above stack 
trace). This happens here 
https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L260.
 But we switch to the isolated classloader before calling 
`HiveClientImpl.client` (See 
https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L283).
 This is exactly what I mentioned above: `If Hadoop VersionInfo is not 
initialized before we switch to the isolated classloader, and we try to 
initialize it using the isolated classloader (the current thread context 
classloader), it will fail`
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   The new regression test added in this PR.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to