wangyum commented on a change in pull request #23788: [SPARK-27176][SQL]
Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4
URL: https://github.com/apache/spark/pull/23788#discussion_r271147574
##########
File path: pom.xml
##########
@@ -2656,7 +2718,24 @@
<hadoop.version>3.2.0</hadoop.version>
<curator.version>2.13.0</curator.version>
<zookeeper.version>3.4.13</zookeeper.version>
+ <hive.group>org.apache.hive</hive.group>
+ <hive.classifier>core</hive.classifier>
+ <hive.version>2.3.4</hive.version>
+ <hive.version.short>${hive.version}</hive.version.short>
+ <hive.extra.deps.scope>${hive.deps.scope}</hive.extra.deps.scope>
+ <hive.parquet.version>${parquet.version}</hive.parquet.version>
+ <orc.classifier></orc.classifier>
+ <hive.parquet.group>org.apache.parquet</hive.parquet.group>
+ <datanucleus-core.version>4.1.17</datanucleus-core.version>
</properties>
+ <dependencies>
+ <!-- Both ORC and Parquet need hive-storage-api, but it is excluded by
orc-mapreduce -->
+ <dependency>
+ <groupId>org.apache.hive</groupId>
+ <artifactId>hive-storage-api</artifactId>
+ <version>2.6.0</version>
Review comment:
No, Both `Hive` and `ORC` need `hive-storage-api`:
1. Remove `hive-storage-api` and save as table:
```java
scala> spark.range(10).write.saveAsTable("test2")
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at
org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:85)
at
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDF(Registry.java:177)
at
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDF(Registry.java:170)
at
org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:209)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:247)
at
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:388)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288)
at
org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:258)
at
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:280)
at
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
at
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
at
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270)
at
org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:361)
at
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217)
at
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:139)
at
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:129)
at
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:40)
at
org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:55)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:90)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:90)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:420)
at
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:446)
at
org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:441)
... 47 elided
Caused by: java.lang.reflect.InvocationTargetException:
java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/serde2/io/HiveDecimalWritable
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:83)
... 75 more
Caused by: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/serde2/io/HiveDecimalWritable
at
org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloorCeilBase.<init>(GenericUDFFloorCeilBase.java:48)
at
org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor.<init>(GenericUDFFloor.java:41)
... 80 more
```
2. Remove `hive-storage-api` and write to ORC:
```java
scala> spark.range(10).write.orc("test3")
19/04/01 21:47:40 WARN DAGScheduler: Broadcasting large task binary with
size 172.4 KiB
[Stage 0:> (0 + 4)
/ 4]19/04/01 21:47:41 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/exec/vector/ColumnVector
at
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.createOrcValue(OrcSerializer.scala:226)
at
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.<init>(OrcSerializer.scala:36)
at
org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:37)
at
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:120)
at
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
at
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$14(FileFormatWriter.scala:177)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:428)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1321)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:431)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hive.ql.exec.vector.ColumnVector
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]