[jira] [Updated] (HUDI-268) Allow parquet/avro versions upgrading in Hudi

Udit Mehrotra (Jira) Fri, 20 Sep 2019 15:40:12 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Udit Mehrotra updated HUDI-268:
-------------------------------
    Description: 
As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might work 
fine for older versions of Spark and Hive.

But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1* and *Avro 
1.8.2* using:
{code:java}
mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5-amzn-4 
-Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 
-Dparquet.version=1.10.1 -Davro.version=1.8.2
{code}
We run into runtime issue on *Hive 2.3.5* when querying RT tables:
{noformat}
hive> select record_key from mytable_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/avro/LogicalType
        at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
        at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
        at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
        at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
        at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
        at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
        at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
This is happening because we are shading *parquet-avro* which is now *1.10.1*. 
And it requires *Avro 1.8.2* which has this *LogicalType* class. However, *Hive 
2.3.5* has *Avro 1.7.7* available at runtime which does not have *LogicalType* 
class.

To avoid these scenarios, and atleast allow usage of higher versions of Spark 
without affecting Hive/Presto integrations we propose the following:
 * Compile Hudi with the Parquet/Avro version used by Spark always without 
worrying about Hive/Presto.
 * Shade Avro in *hadoop-mr-bundle* and *presto-bundle* to avoid issues due to 
older version of Avro being available there.

This will also help in our other issues, where we want to upgrade to Spark 2.4 
and also deprecate use of databricks-avro. Thoughts ?

 

  was:
As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might work 
fine for older versions of Spark and Hive.

But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1* and *Avro 
1.8.2* using:
{code:java}
mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5-amzn-4 
-Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 
-Dparquet.version=1.10.1 -Davro.version=1.8.2
{code}
We run into runtime issue on *Hive 2.3.5* when querying RT tables:
{noformat}
hive> select record_key from mytable_mor_sep20_01_rt limit 10;
OK
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/avro/LogicalType
        at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
        at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
        at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
        at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
        at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
        at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
        at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
        at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
This is happening because we are shading *parquet-avro* which is now *1.10.1*. 
And it requires *Avro 1.8.2* which has this *LogicalType* class. However, *Hive 
2.3.5* has *Avro 1.7.7* available at runtime which does not have *LogicalType* 
class.

To avoid these scenarios, and atleast allow usage of higher versions of Spark 
without affecting Hive/Presto integrations we propose the following:
 * Compile Hudi with the Parquet/Avro version used by Spark.
 * Shade Avro in *hadoop-mr-bundle* and *presto-bundle* to avoid issues due to 
older version of Avro being available there.

This will also help in our other issues, where we want to upgrade to Spark 2.4 
and also deprecate use of databricks-avro. Thoughts ?

 


> Allow parquet/avro versions upgrading in Hudi
> ---------------------------------------------
>
>                 Key: HUDI-268
>                 URL: https://issues.apache.org/jira/browse/HUDI-268
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Hive Integration, Presto Integration, Usability
>            Reporter: Udit Mehrotra
>            Priority: Major
>
> As of now Hudi depends on *Parquet* *1.8.1* and *Avro* *1.7.7* which might 
> work fine for older versions of Spark and Hive.
> But when we build it with *Spark* *2.4.3* which uses *Parquet 1.10.1* and 
> *Avro 1.8.2* using:
> {code:java}
> mvn clean install -DskipTests -DskipITs -Dhadoop.version=2.8.5-amzn-4 
> -Dspark.version=2.4.3 -Dhbase.version=1.4.10 -Dhive.version=2.3.5 
> -Dparquet.version=1.10.1 -Davro.version=1.8.2
> {code}
> We run into runtime issue on *Hive 2.3.5* when querying RT tables:
> {noformat}
> hive> select record_key from mytable_mor_sep20_01_rt limit 10;
> OK
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/avro/LogicalType
>       at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:323)
>       at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:105)
>       at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48)
>       at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
>       at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45)
>       at 
> org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:234)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
>       at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
>       at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
>       at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
>       at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>       at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>       at 
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
>       at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>       at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
>       at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:357){noformat}
> This is happening because we are shading *parquet-avro* which is now 
> *1.10.1*. And it requires *Avro 1.8.2* which has this *LogicalType* class. 
> However, *Hive 2.3.5* has *Avro 1.7.7* available at runtime which does not 
> have *LogicalType* class.
> To avoid these scenarios, and atleast allow usage of higher versions of Spark 
> without affecting Hive/Presto integrations we propose the following:
>  * Compile Hudi with the Parquet/Avro version used by Spark always without 
> worrying about Hive/Presto.
>  * Shade Avro in *hadoop-mr-bundle* and *presto-bundle* to avoid issues due 
> to older version of Avro being available there.
> This will also help in our other issues, where we want to upgrade to Spark 
> 2.4 and also deprecate use of databricks-avro. Thoughts ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-268) Allow parquet/avro versions upgrading in Hudi

Reply via email to