cdmikechen opened a new issue #2005:
URL: https://github.com/apache/hudi/issues/2005
**Describe the problem you faced**
A clear and concise description of the problem.
Hudi in master branch (0.6.1) can not use `hive-sync` to sync to hive with
error
```
Caused by: java.lang.ClassNotFoundException:
parquet.hadoop.ParquetInputFormat
```
Steps to reproduce the behavior:
1. run a `HoodieDeltaStreamer` task by master `local[2]` and sync hudi table
to hive
2. when sync to hive, it report error:
```
java.lang.NoClassDefFoundError: parquet/hadoop/ParquetInputFormat
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.<init>(MapredParquetInputFormat.java:46)
~[hive-exec-1.2.1.spark2.jar:1.2.1.spark2]
at
org.apache.hudi.hadoop.HoodieParquetInputFormat.<init>(HoodieParquetInputFormat.java:67)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getInputFormat(HoodieInputFormatUtils.java:82)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getInputFormatClassName(HoodieInputFormatUtils.java:92)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:159)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:130)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:98)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:510)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:425)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:244)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:161)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:159)
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
***
***
Caused by: java.lang.ClassNotFoundException:
parquet.hadoop.ParquetInputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
~[na:1.8.0_251]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[na:1.8.0_251]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
~[na:1.8.0_251]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[na:1.8.0_251]
... 20 common frames omitted
```
**Environment Description**
* Hudi version : 0.6.1
* Spark version : 2.4.3
* Hive version : 2.3.3
* Hadoop version : 2.8.5
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : no
**Additional context**
I checked the error code:
```java
public static FileInputFormat getInputFormat(HoodieFileFormat
baseFileFormat, boolean realtime, Configuration conf) {
switch (baseFileFormat) {
case PARQUET:
if (realtime) {
HoodieParquetRealtimeInputFormat inputFormat = new
HoodieParquetRealtimeInputFormat();
inputFormat.setConf(conf);
return inputFormat;
} else {
HoodieParquetInputFormat inputFormat = new
HoodieParquetInputFormat();
inputFormat.setConf(conf);
return inputFormat;
}
default:
throw new HoodieIOException("Hoodie InputFormat not implemented for
base file format " + baseFileFormat);
}
}
public static String getInputFormatClassName(HoodieFileFormat
baseFileFormat, boolean realtime, Configuration conf) {
FileInputFormat inputFormat = getInputFormat(baseFileFormat, realtime,
conf);
return inputFormat.getClass().getName();
}
```
I think new a `ParquetInputFormat` may not a good idea for hudi in spark. In
`hive-sync` package hudi just need a `FileInputFormat` class name, there is no
need to new an object and get a class name. Meanwhile, spark also doesn't have
total hive jars to do some action like hive.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]