[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-573235373 @bvaradar Created a JIRA to track documentation of Avro shading caveat https://issues.apache.org/jira/browse/HUDI-519 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-572232472 @vinothchandar @bvaradar @n3nash I have updated the PR to now use the `hive-exec` with `core` classifier to solve the unit test issues that were occuring becuase of Hive. Removed the usage of `spark.hive.version` as desired. Let me know if this looks good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-572199147 Ack working with that deadline in mind. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-571730182 > @umehrot2 are you still driving this? We would like to merge this asap, giving us enough time for the next release to be cut.. > > cc @leesf @vinothchandar just got back from my time off a couple of days back. Let me catch up on this PR and try to get it merged soon. When are we targeting for next release to be cut ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-555259474 @vinothchandar Yes updating quick start page makes sense. Will you guys be doing that ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-555259043 > @umehrot2 : Sorry for the back-and-forth on this. Issue 1 (as mentioned in [#1005 (comment)](https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554089712)) is due to fat jar hive-exec. @n3nash proposed a solution in Uber which wont require moving to spark-hive. Instead of the test dependency : hive-exec, can you try depending on the non-fat version of the jar called : hive-exec-core. Hopefully, we can control parquet/avro versions getting loaded for the tests. @bvaradar That's fine, we should take time and solve the right way. In `hudi-utilities` it is a test dependency, but not in `hudi-spark`. `hudi-spark` depends on `hive-service` which is bringing in `hive-exec` as a transitive dependency. And that is the reason its ending up in tests classpath. We would have to get rid of `hive-service` if we were to do that. Also I don't see any artifact like `hive-exec-core`. Maven does not recognize it. But besides that, when we are depending on runtime Spark's version of Hive, is there any reason why we are wanting to build it with hive 2.x instead ? May be `hudi-hive` in itself since that is an independent module makes sense to build with hive 2.x. But anything running within spark, why are we inclined to building with version of hive not supported by spark ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554169864 > > @bvaradar can we re-trigger the tests ? I think this time it failed due to flaky timeouts > > @umehrot2 : I re-triggered anyways but looks like the log length reached max limits @bvaradar you are right, it failed again because of log length. Any suggestions ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554148996 @bvaradar can we re-trigger the tests ? I think this time it failed due to flaky timeouts This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554089712 @modi95 @bvaradar I was able to fix the integration test dependency issues on my local atleast. Hoping that things run fine on Travis too. To give an overview, there were 3 major failures happening: 1. The `ITTestHoodieSanity` tests were failing firstly becuase of this error: ``` 17:15:31.995 [pool-21-thread-2] ERROR org.apache.hudi.io.HoodieCreateHandle - Error writing record HoodieRecord{key=HoodieKey { recordKey=98ea14b7-b318-4b0b-9f14-0115900a10e0 partitionPath=2016/03/15}, currentLocation='null', newLocation='null'} java.lang.NoSuchMethodError: org.apache.parquet.io.api.Binary.fromCharSequence(Ljava/lang/CharSequence;)Lorg/apache/parquet/io/api/Binary; at org.apache.parquet.avro.AvroWriteSupport.fromAvroString(AvroWriteSupport.java:371) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:346) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121) ~[hive-exec-2.3.1.jar:1.10.1] at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:288) ~[hive-exec-2.3.1.jar:1.10.1] at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:91) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:101) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:150) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:142) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] ``` This is happening because in Hudi even for bits running through Spark we are using `Hive 2.3.1` which is not really compatible with Hive. So, `hive-exec 2.3.1` ends up in `HoodieJavaApp` classpath while running the example, and that has its own shaded parquet version which is old and conflicts with `parquet 1.10.1`. What I propose here, is that we should use version of Hive that is compatible with Spark, atleast for the bits running inside Spark so that compatible versions of Hive end up in class paths. Now `hive-exec 1.2.1.spark2` does not cause this issue as it does not shade parquet. Also, we have removed Hive shading in master now, so anyways we are dependent on runtime Hive version which is Spark's Hive version. So, from code's perspective also I think it makes sense to depend on Spark's Hive version for the code which is running inside of Spark to avoid such issues. 2. Post that `ITTestHoodieSanity` all the `_rt` tests were failing because now that our code is using `Avro 1.8.2` and Hive is still on older versions, we need to shade avro in `hudi-hadoop-mr-bundle` which we had done internally for EMR through an optional profile. Now that we are migrating Hudi itself to Avro 1.8.2 we need to always shade Hive to get around this issue. More details on https://issues.apache.org/jira/browse/HUDI-268 3. Finally some tests were failing because `spark-avro` was not being passed while starting the spark-shell, and it was not finding the classes. So, I switched over to downloading `spark-avro` instead of `databricks-avro` By making the above changes, the integration tests work now. Let me know your thoughts about these changes, if there are concerns. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554016570 > Looks like the integration tests are failing with dependency version mismatches. @bvaradar yeah have been looking into them. Like you mentioned there are multiple dependency related issues going on. Working on a solution. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552798034 > @umehrot2 : If you want to give it a shot, its better to open a new PR, You would need to update docker.spark.version in pom files below docker/hadoop/... and also update spark version in Dockerfile in these directories : spark_base, sparkadhoc, sparkworker and sparkmaster. > > You can then run docker/build_local_docker_images.sh to build new docker images locally and then run integration tests. We would have to push these docker images so that travis integration can pick it up (we will help on this). > > @umehrot2 : If you need help, let us know. Either me or @bhasudha can get the docker images built and pushed. @bvaradar thanks for the suggestions. Will give it a shot tomorrow, and reach out in case of any doubts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552683643 > Change looks good. But we need to get the tests working reliably and think about changes to bundling.. Our docker images are still 2.3x . Should we upgrade and push them as well ? @bvaradar do you want to shepherd this since you are working closely with @modi95 as well? Yeah any guidance on this front would be appreciated. On how we want to go about it for getting the tests working. I will look into the docker setup. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552682699 > Hi Udit! Thanks for making this PR! > > I've been working on upgrading HUDI to Spark 2.4 internally at Uber! So I'll list out a few things that I had to do, so that you're not trying to re-discover these things yourself :) > > 1. Some of the `create` functions in `HoodieWrapperFileSystem` don't fully work with Parquet 1.10+. See [here](https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-common/src/main/java/org/apache/hudi/common/io/storage/HoodieWrapperFileSystem.java#L146). We'll need to make sure that all the `create` functions correctly call `wrapOutputStream`. > > 2. `hive-exec` is a fat JAR. It might be causing unit tests failure as it may introduce an older version of avro into the calsspath. We're currently trying to figure out how to address this. Let us know if you have any suggestions! > > > Btw - I also went to UIUC! Great to meet new Illini! @modi95 good to hear from a fellow Illini ! About the points you raised: 1. I guess I am not quite sure yet, where this comes into the picture. Did you run into an actual issue ? Is there an easy way to reproduce this ? 2. Yeah I need to look into the test failures. Not sure of the reasons yet. But first thing I guess I need to do is upgrade the docker images right, to use spark 2.4.4 and probably hive 2.3.6 ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552051388 > Sending this PR out early to get feedback. Have not yet looked into what changes are required for tests. But in general these changes have been working for us on AWS EMR without any issues so far. This PR implements the following: > > * Migrates Hudi to Spark 2.4.4 > > * Migrates Hudi to use spark-avro module instead of the deprecated databricks-avro > > * Adds support for Decimal/Date types correctly through Hive > > * Timestamp still needs to be supported as it is blocked on https://issues.apache.org/jira/browse/HUDI-83 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services