[ https://issues.apache.org/jira/browse/HUDI-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Udit Mehrotra updated HUDI-268: ------------------------------- Component/s: (was: Presto Integration) > Allow parquet/avro versions upgrading in Hudi > --------------------------------------------- > > Key: HUDI-268 > URL: https://issues.apache.org/jira/browse/HUDI-268 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Hive Integration, Usability > Reporter: Udit Mehrotra > Assignee: Udit Mehrotra > Priority: Major > > As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might > work fine for older versions of Spark and Hive. > But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1* and > *Avro 1.8.2* using: > {code:java} > mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5 > -Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 > -Dparquet.version=1.10.1 -Davro.version=1.8.2 > {code} > We run into runtime issue on *Hive 2.3.5* when querying RT tables: > {noformat} > hive> select record_key from mytable_mor_sep20_01_rt limit 10; > OK > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/avro/LogicalType > at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323) > at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45) > at > org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234) > at > org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428) > at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) > at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208) > at > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) > at > org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:239) > at org.apache.hadoop.util.RunJar.main(RunJar.java:153) > Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat} > This is happening because we are shading *parquet-avro* which is now > *1.10.1*. And it requires *Avro 1.8.2* which has this *LogicalType* class. > However, *Hive 2.3.5* has *Avro 1.7.7* available at runtime which does not > have *LogicalType* class. > To avoid these scenarios, and atleast allow usage of higher versions of Spark > without affecting Hive integrations we propose the following: > * Compile Hudi with the Parquet/Avro version used by Spark always. > * Shade Avro in *hadoop-mr-bundle* to avoid issues due to older version of > Avro being available there. > This will also help in our other issues, where we want to upgrade to Spark > 2.4 and also deprecate use of databricks-avro. Thoughts ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)