Udit Mehrotra created HUDI-268:
----------------------------------

             Summary: Allow parquet/avro versions upgrading in Hudi
                 Key: HUDI-268
                 URL: https://issues.apache.org/jira/browse/HUDI-268
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Hive Integration, Presto Integration, Usability
            Reporter: Udit Mehrotra


As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might work 
fine for older versions of Spark and Hive.

But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1*  and 
*Avro 1.8.2* using:
{code:java}
mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5-amzn-4 
-Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 
-Dparquet.version=1.10.1 -Davro.version=1.8.2
{code}
We run into runtime issue on *Hive 2.3.5* when querying RT tables:
{noformat}
hive> select record_key, cs_wholesale_cost, cs_ext_sales_price from 
catalog_sales_part_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/avro/LogicalType
        at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
        at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
        at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
        at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
        at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
        at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
        at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
This is happening because we are shading *parquet-avro* which is now *1.10.1*. 
And it requires *Avro 1.8.2* which has this *LogicalType* class. However, *Hive 
2.3.5* has *Avro 1.7.7* available at runtime which does not have *LogicalType* 
class.

To avoid these scenarios, and atleast allow usage of higher versions of Spark 
without affecting Hive/Presto integrations we propose the following:
 * Compile Hudi with the Parquet/Avro version used by Spark.
 * Shade Avro in *hadoop-mr-bundle* and *presto-bundle* to avoid issues due to 
older version of Avro being available there.

This will also help in our other issues, where we want to upgrade to Spark 2.4 
and also deprecate use of databricks-avro. Thoughts ?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to