[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2020-01-10 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-573235373
 
 
   @bvaradar Created a JIRA to track documentation of Avro shading caveat 
https://issues.apache.org/jira/browse/HUDI-519


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2020-01-08 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-572232472
 
 
   @vinothchandar @bvaradar @n3nash I have updated the PR to now use the 
`hive-exec` with `core` classifier to solve the unit test issues that were 
occuring becuase of Hive. Removed the usage of `spark.hive.version` as desired. 
Let me know if this looks good.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2020-01-08 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-572199147
 
 
   Ack working with that deadline in mind.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2020-01-07 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-571730182
 
 
   > @umehrot2 are you still driving this? We would like to merge this asap, 
giving us enough time for the next release to be cut..
   > 
   > cc @leesf
   @vinothchandar just got back from my time off a couple of days back. Let me 
catch up on this PR and try to get it merged soon. When are we targeting for 
next release to be cut ? 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-18 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-555259474
 
 
   @vinothchandar Yes updating quick start page makes sense. Will you guys be 
doing that ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-18 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-555259043
 
 
   > @umehrot2 : Sorry for the back-and-forth on this. Issue 1 (as mentioned in 
[#1005 
(comment)](https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554089712))
 is due to fat jar hive-exec. @n3nash proposed a solution in Uber which wont 
require moving to spark-hive. Instead of the test dependency : hive-exec, can 
you try depending on the non-fat version of the jar called : hive-exec-core. 
Hopefully, we can control parquet/avro versions getting loaded for the tests.
   
   @bvaradar That's fine, we should take time and solve the right way.
   
   In `hudi-utilities` it is a test dependency, but not in `hudi-spark`. 
`hudi-spark` depends on `hive-service` which is bringing in `hive-exec` as a 
transitive dependency.  And that is the reason its ending up in tests 
classpath. We would have to get rid of `hive-service` if we were to do that.
   
   Also I don't see any artifact like `hive-exec-core`. Maven does not 
recognize it. But besides that, when we are depending on runtime Spark's 
version of Hive, is there any reason why we are wanting to build it with hive 
2.x instead ? May be `hudi-hive` in itself since that is an independent module 
makes sense to build with hive 2.x. But anything running within spark, why are 
we inclined to building with version of hive not supported by spark ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-14 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554169864
 
 
   > > @bvaradar can we re-trigger the tests ? I think this time it failed due 
to flaky timeouts
   > 
   > @umehrot2 : I re-triggered anyways but looks like the log length reached 
max limits
   
   @bvaradar you are right, it failed again because of log length. Any 
suggestions ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-14 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554148996
 
 
   @bvaradar can we re-trigger the tests ? I think this time it failed due to 
flaky timeouts


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-14 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554089712
 
 
   @modi95 @bvaradar 
   
   I was able to fix the integration test dependency issues on my local 
atleast. Hoping that things run fine on Travis too. To give an overview, there 
were 3 major failures happening:
   
   1. The `ITTestHoodieSanity` tests were failing firstly becuase of this error:
   ```
   17:15:31.995 [pool-21-thread-2] ERROR org.apache.hudi.io.HoodieCreateHandle 
- Error writing record HoodieRecord{key=HoodieKey { 
recordKey=98ea14b7-b318-4b0b-9f14-0115900a10e0 partitionPath=2016/03/15}, 
currentLocation='null', newLocation='null'}
   
   java.lang.NoSuchMethodError: 
org.apache.parquet.io.api.Binary.fromCharSequence(Ljava/lang/CharSequence;)Lorg/apache/parquet/io/api/Binary;
   
at 
org.apache.parquet.avro.AvroWriteSupport.fromAvroString(AvroWriteSupport.java:371)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:346)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
 ~[hive-exec-2.3.1.jar:1.10.1]
   
at 
org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:288) 
~[hive-exec-2.3.1.jar:1.10.1]
   
at 
org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:91)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:101) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:150) 
~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:142)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   
at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38)
 ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]
   ```
   
   This is happening because in Hudi even for bits running through Spark we are 
using `Hive 2.3.1` which is not really compatible with Hive. So, `hive-exec 
2.3.1` ends up in `HoodieJavaApp` classpath while running the example, and that 
has its own shaded parquet version which is old and conflicts with `parquet 
1.10.1`.
   
   What I propose here, is that we should use version of Hive that is 
compatible with Spark, atleast for the bits running inside Spark so that 
compatible versions of Hive end up in class paths. Now `hive-exec 1.2.1.spark2` 
does not cause this issue as it does not shade parquet. Also, we have removed 
Hive shading in master now, so anyways we are dependent on runtime Hive version 
which is Spark's Hive version. So, from code's perspective also I think it 
makes sense to depend on Spark's Hive version for the code which is running 
inside of Spark to avoid such issues.
   
   2. Post that `ITTestHoodieSanity` all the `_rt` tests were failing because 
now that our code is using `Avro 1.8.2` and Hive is still on older versions, we 
need to shade avro in `hudi-hadoop-mr-bundle` which we had done internally for 
EMR through an optional profile. Now that we are migrating Hudi itself to Avro 
1.8.2 we need to always shade Hive to get around this issue. More details on 
https://issues.apache.org/jira/browse/HUDI-268
   
   3. Finally some tests were failing because `spark-avro` was not being passed 
while starting the spark-shell, and it was not finding the classes. So, I 
switched over to downloading `spark-avro` instead of `databricks-avro`
   
   By making the above changes, the integration tests work now. Let me know 
your thoughts about these changes, if there are concerns.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the 

[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-14 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-554016570
 
 
   > Looks like the integration tests are failing with dependency version 
mismatches.
   
   @bvaradar yeah have been looking into them. Like you mentioned there are 
multiple dependency related issues going on. Working on a solution.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-12 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552798034
 
 
   > @umehrot2 : If you want to give it a shot, its better to open a new PR, 
You would need to update docker.spark.version in pom files below 
docker/hadoop/... and also update spark version in Dockerfile in these 
directories : spark_base, sparkadhoc, sparkworker and sparkmaster.
   > 
   > You can then run docker/build_local_docker_images.sh to build new docker 
images locally and then run integration tests. We would have to push these 
docker images so that travis integration can pick it up (we will help on this).
   > 
   > @umehrot2 : If you need help, let us know. Either me or @bhasudha can get 
the docker images built and pushed.
   
   @bvaradar thanks for the suggestions. Will give it a shot tomorrow, and 
reach out in case of any doubts.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-11 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552683643
 
 
   > Change looks good. But we need to get the tests working reliably and think 
about changes to bundling.. Our docker images are still 2.3x . Should we 
upgrade and push them as well ? @bvaradar do you want to shepherd this since 
you are working closely with @modi95 as well?
   
   Yeah any guidance on this front would be appreciated. On how we want to go 
about it for getting the tests working. I will look into the docker setup.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-11 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552682699
 
 
   > Hi Udit! Thanks for making this PR!
   > 
   > I've been working on upgrading HUDI to Spark 2.4 internally at Uber! So 
I'll list out a few things that I had to do, so that you're not trying to 
re-discover these things yourself :)
   > 
   > 1. Some of the `create` functions in `HoodieWrapperFileSystem` don't 
fully work with Parquet 1.10+. See 
[here](https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-common/src/main/java/org/apache/hudi/common/io/storage/HoodieWrapperFileSystem.java#L146).
 We'll need to make sure that all the `create` functions correctly call 
`wrapOutputStream`.
   > 
   > 2. `hive-exec` is a fat JAR. It might be causing unit tests failure as 
it may introduce an older version of avro into the calsspath. We're currently 
trying to figure out how to address this. Let us know if you have any 
suggestions!
   > 
   > 
   > Btw - I also went to UIUC! Great to meet new Illini!
   
   @modi95 good to hear from a fellow Illini !
   
   About the points you raised:
   
   1.  I guess I am not quite sure yet, where this comes into the picture. Did 
you run into an actual issue ? Is there an easy way to reproduce this ?
   
   2. Yeah I need to look into the test failures. Not sure of the reasons yet. 
But first thing I guess I need to do is upgrade the docker images right, to use 
spark 2.4.4 and probably hive 2.3.6 ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2019-11-08 Thread GitBox
umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-552051388
 
 
   > Sending this PR out early to get feedback. Have not yet looked into what 
changes are required for tests. But in general these changes have been working 
for us on AWS EMR without any issues so far. This PR implements the following:
   > 
   > * Migrates Hudi to Spark 2.4.4
   > 
   > * Migrates Hudi to use spark-avro module instead of the deprecated 
databricks-avro
   > 
   > * Adds support for Decimal/Date types correctly through Hive
   > 
   > * Timestamp still needs to be supported as it is blocked on 
https://issues.apache.org/jira/browse/HUDI-83
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services