zsxwing opened a new pull request #29059:
URL: https://github.com/apache/spark/pull/29059
### What changes were proposed in this pull request?
Force to initialize Hadoop VersionInfo in HiveExternalCatalog to make sure
Hive can get the Hadoop version when using the isolated classloader.
### Why are the changes needed?
This is a regression in Spark 3.0.0 because we switched the default Hive
execution version from 1.2.1 to 2.3.7.
Spark allows the user to set `spark.sql.hive.metastore.jars` to specify jars
to access Hive Metastore. These jars are loaded by the isolated classloader.
Because we also share Hadoop classes with the isolated classloader, the user
doesn't need to add Hadoop jars to `spark.sql.hive.metastore.jars`, which means
when we are using the isolated classloader, hadoop-common jar is not available
in this case. If Hadoop VersionInfo is not initialized before we switch to the
isolated classloader, and we try to initialize it using the isolated
classloader (the current thread context classloader), it will fail and report
`Unknown` which causes Hive to throw the following exception:
```
08:49:33.242 ERROR org.apache.hadoop.hive.shims.ShimLoader: Error loading
shims
java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.*
format)
at
org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147)
at
org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122)
at
org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88)
at
org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
at
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:219)
at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:67)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511)
at
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175)
at
org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:128)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324)
at
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:72)
at
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:71)
```
Technically, This is indeed an issue of Hadoop VersionInfo which has been
fixed: https://issues.apache.org/jira/browse/HADOOP-14067. But since we are
still supporting old Hadoop versions, we should fix it.
Why this issue starts to happen in Spark 3.0.0?
In Spark 2.4.x, we use Hive 1.2.1 by default. It will trigger `VersionInfo`
initialization in the static codes of `Hive` class. This will happen when we
load `HiveClientImpl` class because `HiveClientImpl.clent` method refers to
`Hive` class. At this moment, the thread context classloader is not using the
isolcated classloader, so it can access hadoop-common jar on the classpath and
initialize it correctly.
In Spark 3.0.0, we use Hive 2.3.7. The static codes of `Hive` class are not
accessing `VersionInfo` because of the change in
https://issues.apache.org/jira/browse/HIVE-11657. Instead, accessing
`VersionInfo` happens when creating a `Hive` object (See the above stack
trace). This happens here
https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L260.
But we switch to the isolated classloader before calling
`HiveClientImpl.client` (See
https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L283).
This is exactly what I mentioned above: `If Hadoop VersionInfo is not
initialized before we switch to the isolated classloader, and we try to
initialize it using the isolated classloader (the current thread context
classloader), it will fail`
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The new regression test added in this PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]