[GitHub] [hudi] cdmikechen opened a new issue #2005: [SUPPORT] hudi hive-sync in master branch (0.6.1) can not run by spark with local

GitBox Thu, 20 Aug 2020 23:00:54 -0700


cdmikechen opened a new issue #2005:
URL: https://github.com/apache/hudi/issues/2005



   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   Hudi in master branch (0.6.1)  can not use `hive-sync` to sync to hive with 
error 
   ```
   Caused by: java.lang.ClassNotFoundException: 
parquet.hadoop.ParquetInputFormat
   ```
   
   Steps to reproduce the behavior:
   
   1. run a `HoodieDeltaStreamer` task by master `local[2]` and sync hudi table 
to hive
   2. when sync to hive, it report error:
   ```
   java.lang.NoClassDefFoundError: parquet/hadoop/ParquetInputFormat
        at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.<init>(MapredParquetInputFormat.java:46)
 ~[hive-exec-1.2.1.spark2.jar:1.2.1.spark2]
        at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.<init>(HoodieParquetInputFormat.java:67)
 ~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getInputFormat(HoodieInputFormatUtils.java:82)
 ~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getInputFormatClassName(HoodieInputFormatUtils.java:92)
 ~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:159) 
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:130) 
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:98) 
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:510) 
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:425)
 ~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:244) 
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:161)
 ~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) 
~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:159)
 ~[hudi-utilities-bundle_2.11-0.6.1-SNAPSHOT.jar:0.6.1-SNAPSHOT]
        ***
           ***
   Caused by: java.lang.ClassNotFoundException: 
parquet.hadoop.ParquetInputFormat
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382) 
~[na:1.8.0_251]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[na:1.8.0_251]
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) 
~[na:1.8.0_251]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[na:1.8.0_251]
        ... 20 common frames omitted
   ```
   
   **Environment Description**
   
   * Hudi version : 0.6.1
   
   * Spark version : 2.4.3
   
   * Hive version : 2.3.3
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I checked the error code:
   ```java
     public static FileInputFormat getInputFormat(HoodieFileFormat 
baseFileFormat, boolean realtime, Configuration conf) {
       switch (baseFileFormat) {
         case PARQUET:
           if (realtime) {
             HoodieParquetRealtimeInputFormat inputFormat = new 
HoodieParquetRealtimeInputFormat();
             inputFormat.setConf(conf);
             return inputFormat;
           } else {
             HoodieParquetInputFormat inputFormat = new 
HoodieParquetInputFormat();
             inputFormat.setConf(conf);
             return inputFormat;
           }
         default:
           throw new HoodieIOException("Hoodie InputFormat not implemented for 
base file format " + baseFileFormat);
       }
     }
   
     public static String getInputFormatClassName(HoodieFileFormat 
baseFileFormat, boolean realtime, Configuration conf) {
       FileInputFormat inputFormat = getInputFormat(baseFileFormat, realtime, 
conf);
       return inputFormat.getClass().getName();
     }
   ```
   I think new a `ParquetInputFormat` may not a good idea for hudi in spark. In 
`hive-sync` package hudi just need a `FileInputFormat` class name, there is no 
need to new an object and get a class name. Meanwhile, spark also doesn't have 
total hive jars to do some action like hive.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] cdmikechen opened a new issue #2005: [SUPPORT] hudi hive-sync in master branch (0.6.1) can not run by spark with local

Reply via email to