Udit Mehrotra created HUDI-268:
----------------------------------
Summary: Allow parquet/avro versions upgrading in Hudi
Key: HUDI-268
URL: https://issues.apache.org/jira/browse/HUDI-268
Project: Apache Hudi (incubating)
Issue Type: Improvement
Components: Hive Integration, Presto Integration, Usability
Reporter: Udit Mehrotra
As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might work
fine for older versions of Spark and Hive.
But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1* and
*Avro 1.8.2* using:
{code:java}
mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5-amzn-4
-Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5
-Dparquet.version=1.10.1 -Davro.version=1.8.2
{code}
We run into runtime issue on *Hive 2.3.5* when querying RT tables:
{noformat}
hive> select record_key, cs_wholesale_cost, cs_ext_sales_price from
catalog_sales_part_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/avro/LogicalType
at
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
at
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
at
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
at
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
at
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
at
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
at
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
This is happening because we are shading *parquet-avro* which is now *1.10.1*.
And it requires *Avro 1.8.2* which has this *LogicalType* class. However, *Hive
2.3.5* has *Avro 1.7.7* available at runtime which does not have *LogicalType*
class.
To avoid these scenarios, and atleast allow usage of higher versions of Spark
without affecting Hive/Presto integrations we propose the following:
* Compile Hudi with the Parquet/Avro version used by Spark.
* Shade Avro in *hadoop-mr-bundle* and *presto-bundle* to avoid issues due to
older version of Avro being available there.
This will also help in our other issues, where we want to upgrade to Spark 2.4
and also deprecate use of databricks-avro. Thoughts ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)