devanshguptatrepp opened a new issue, #7261:
URL: https://github.com/apache/hudi/issues/7261
**Describe the problem you faced**
When trying to write data to hudi, my spark application fails with following
error
`java.lang.Exception: Could not sync using the meta sync class
org.apache.hudi.hive.HiveSyncTool
at
com.trepp.zone.ZoneExecutionHelper.upsert(ZoneExecutionHelper.scala:101)
at
com.trepp.zone.Presentation.$anonfun$writeHudiObject$1(Presentation.scala:92)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at com.trepp.zone.Presentation.writeHudiObject(Presentation.scala:81)`
I am reading data from an Amazon S3 bucket and doing some transformation
before writing data to hudi.
And I am using hudi-spark3-bundle_2.12 available on maven central for Scala
12
**Expected behavior**
Successfully able to write to Hudi location
**Environment Description**
Scala version : 2.12.17
Hudi version : 0.12.0
Spark version : 3.3.0
Hadoop version : 2.8.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Running on AWS EMR version: 6.8.0
**Additional context**
Spark submit command:
1. `spark-submit --deploy-mode client --jars
s3a://bucketName/binaries/hudi/etl/hudi-spark3-bundle_2.12-0.12.0.jar --class
com.xyz.EntryClass --master yarn --conf spark.files.maxPartitionBytes=268435456
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf
spark.dynamicAllocation.enabled=true --conf hoodie.meta.sync.enable=false
s3://bucketName/application-1.0-SNAPSHOT-jar-with-dependencies.jar`: Reason for
using spark-bundle_2.12-0.12.0 is because 0.12.x supports 3.3.x scala version
as per [here](https://hudi.apache.org/docs/quick-start-guide/#setup)
2. `spark-submit --deploy-mode client --jars
s3a://bucketName/binaries/hudi/etl/hudi-spark-bundle_2.12-0.11.1.jar --class
com.xyz.EntryClass --master yarn --conf spark.files.maxPartitionBytes=268435456
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf
spark.dynamicAllocation.enabled=true --conf hoodie.meta.sync.enable=false
s3://bucketName/application-1.0-SNAPSHOT-jar-with-dependencies.jar`: Reason for
using spark-bundle_2.12-0.11.1 is because AWS EMR 6.8.0 supports hudi 0.11.1 as
per
[here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-680-release.html)
In my pom.xml, I have the following dependencie for hudi-spark bundle:
` <dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark3-bundle_2.12</artifactId>
<version>0.11.1</version>
</dependency>
`
**Stacktrace**
1. Error with spark-submit command number 1
`java.lang.Exception: Could not sync using the meta sync class
org.apache.hudi.hive.HiveSyncTool
at
com.trepp.zone.ZoneExecutionHelper.upsert(ZoneExecutionHelper.scala:101)
at
com.trepp.zone.Presentation.$anonfun$writeHudiObject$1(Presentation.scala:92)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at com.trepp.zone.Presentation.writeHudiObject(Presentation.scala:81)
at com.trepp.process.Executor.$anonfun$writeObject$2(Executor.scala:136)
at
com.trepp.process.Executor.$anonfun$writeObject$2$adapted(Executor.scala:133)
at scala.collection.immutable.List.foreach(List.scala:431)
at com.trepp.process.Executor.$anonfun$writeObject$1(Executor.scala:133)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at com.trepp.process.Executor.writeObject(Executor.scala:133)
at
com.trepp.process.Executor$$anon$2.$anonfun$accept$2(Executor.scala:118)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at com.trepp.process.Executor$$anon$2.accept(Executor.scala:113)
at com.trepp.process.Executor$$anon$2.accept(Executor.scala:111)
at java.util.TreeMap.forEach(TreeMap.java:1005)
at com.trepp.process.Executor.executeQuery(Executor.scala:111)
at
com.trepp.dataload.EtlImpl.$anonfun$executeProcess$3(EtlImpl.scala:43)
at scala.util.Try$.apply(Try.scala:213)
at
com.trepp.dataload.EtlImpl.$anonfun$executeProcess$1(EtlImpl.scala:37)
at
com.trepp.dataload.EtlImpl.$anonfun$executeProcess$1$adapted(EtlImpl.scala:23)
at
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at com.trepp.dataload.EtlImpl.executeProcess(EtlImpl.scala:23)
at com.trepp.TreppClient$.$anonfun$main$1(TreppClient.scala:46)
at scala.util.Try$.apply(Try.scala:213)
at com.trepp.TreppClient$.main(TreppClient.scala:40)
at com.trepp.TreppClient.main(TreppClient.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1006)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1095)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1104)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)`
2. Error with spark-submit command number 2
`2022-11-21T06:59:16,079 ERROR scheduler.TaskSetManager: Task 0 in stage
22.0 failed 4 times; aborting job
java.lang.Exception: Job aborted due to stage failure: Task 0 in stage 22.0
failed 4 times, most recent failure: Lost task 0.3 in stage 22.0 (TID 297)
(ip-10-73-96-169.ec2.internal executor 2): java.io.InvalidClassException:
org.apache.hudi.data.HoodieJavaRDD; local class incompatible: stream classdesc
serialVersionUID = -7482894453477434543, local class serialVersionUID =
-9174748892468899293
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1815)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1640)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527)
at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527)
at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2322)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:138)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]