[GitHub] [incubator-hudi] umehrot2 edited a comment on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

GitBox Thu, 14 Nov 2019 13:56:44 -0800

umehrot2 edited a comment on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 
2.4.4, migrate to spark-avro library instead of databricks-avro, add support 
for Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554089712
 
 
   @modi95 @bvaradar 
   
   I was able to fix the integration test dependency issues on my local 
atleast. Hoping that things run fine on Travis too. To give an overview, there 
were 3 major failures happening:
   
   1. The `ITTestHoodieSanity` tests were failing firstly becuase of this error:
   ```
   17:15:31.995 [pool-21-thread-2] ERROR org.apache.hudi.io.HoodieCreateHandle 
- Error writing record HoodieRecord{key=HoodieKey { 
recordKey=98ea14b7-b318-4b0b-9f14-0115900a10e0 partitionPath=2016/03/15}, 
currentLocation='null', newLocation='null'}
   
   java.lang.NoSuchMethodError: 
org.apache.parquet.io.api.Binary.fromCharSequence(Ljava/lang/CharSequence;)Lorg/apache/parquet/io/api/Binary;
   
        at 
org.apache.parquet.avro.AvroWriteSupport.fromAvroString(AvroWriteSupport.java:371)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:346)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
 ~[hive-exec-2.3.1.jar:1.10.1]
   
        at 
org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:288) 
~[hive-exec-2.3.1.jar:1.10.1]
   
        at 
org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:91)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:101) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:150) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:142)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
        at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   ```
   
   This is happening because in Hudi even for bits running through Spark we are 
using `Hive 2.3.1` which is not really compatible with Spark. So, `hive-exec 
2.3.1` ends up in `HoodieJavaApp` classpath while running the example, and that 
has its own shaded parquet version which is old and conflicts with `parquet 
1.10.1`.
   
   What I propose here, is that we should use version of Hive that is 
compatible with Spark, atleast for the bits running inside Spark so that 
compatible versions of Hive end up in class paths. Now `hive-exec 1.2.1.spark2` 
does not cause this issue as it does not shade parquet. Also, we have removed 
Hive shading in master now, so anyways we are dependent on runtime Hive version 
which is Spark's Hive version. So, from code's perspective also I think it 
makes sense to depend on Spark's Hive version for the code which is running 
inside of Spark to avoid such issues.
   
   2. Post that `ITTestHoodieSanity` all the `_rt` tests were failing because 
now that our code is using `Avro 1.8.2` and Hive is still on older versions, we 
need to shade avro in `hudi-hadoop-mr-bundle` which we had done internally for 
EMR through an optional profile. Now that we are migrating Hudi itself to Avro 
1.8.2 we need to always shade Hive to get around this issue. More details on 
https://issues.apache.org/jira/browse/HUDI-268
   
   3. Finally some tests were failing because `spark-avro` was not being passed 
while starting the spark-shell, and it was not finding the classes. So, I 
switched over to downloading `spark-avro` instead of `databricks-avro`
   
   By making the above changes, the integration tests work now. Let me know 
your thoughts about these changes, if there are concerns.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 edited a comment on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

Reply via email to