Shubham21k opened a new issue, #9919: URL: https://github.com/apache/hudi/issues/9919
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? YES - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** We are trying to upgrade from hudi 0.11.1 to hudi 0.13.1 for batch workloads. we use emr to run these batch HoodieDeltastreamer / HoodieIncrSource jobs. we are facing dependency issues while using emr release (6.7.0) having spark 3.2.x or 3.3.x **To Reproduce** Steps to reproduce the behavior: 1. Run deltastreamer job on emr-6.5.0 (spark 3.1.2) it should run fine. 2.Run same deltastreamer job on emr-6.11.1 (spark 3.3.2) or emr-6.7.0 (spark 3.2.1), job should result in failure (see error logs attached below) **Expected behavior** : As hudi 0.13.1 is compatible with 3.2.x and 3.3.x, hudi jobs should run fine on emr realeases with same spark version. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.2.1 or 3.3.2 * Hive version : not used * Hadoop version : 3.3.3-amzn-3.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : NO **Additional context** Spark step : ```spark-submit --master yarn --jars /usr/lib/spark/external/lib/spark-avro.jar,s3://generic-data-lake/jars/hudi-utilities-bundle_2.12-0.13.1.jar,s3://generic-data-lake/jars/hudi-aws-bundle-0.13.1.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.cores=5 --conf spark.driver.memory=3200m --conf spark.driver.memoryOverhead=800m --conf spark.executor.memoryOverhead=1400m --conf spark.executor.memory=14600m --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=21 --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.yarn.max.executor.failures=5 --conf spark.driver.userClassPathFirst=true --conf spark.e xecutor.userClassPathFirst=true --conf spark.sql.catalogImplementation=hive --deploy-mode cluster s3://generic-data-lake/jars/deltastreamer-addons-2.0-SNAPSHOT.jar --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf hoodie.deltastreamer.source.hoodieincr.num_instants=100 --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.HoodieIncrSource --hoodie-conf hoodie.deltastreamer.source.hoodieincr.path=s3://generic-data-lake/raw-data/credit_underwriting/features --hoodie-conf hoodie.metrics.on=true --hoodie-conf hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY --hoodie-conf hoodie.metrics.pushgateway.host=pushgateway.prod.generic-tech.in --hoodie-conf hoodie.metrics.pushgateway.port=443 --hoodie-conf hoodie.metrics.pushgateway.delete.on.shutdown=false --hoodie-conf hoodie.metrics.pushgateway.job.name=credit_underwriting_transformed_features_accounts_hudi --hoodie-conf hoodie.metrics.pushgateway.random.job.name.suffix=false --hoodie-conf hoodie.met adata.enable=true --hoodie-conf hoodie.metrics.reporter.metricsname.prefix=hudi --target-base-path s3://generic-data-lake/raw-data/credit_underwriting_transformed/features_accounts --target-table features_accounts --enable-sync --hoodie-conf hoodie.datasource.hive_sync.database=credit_underwriting_transformed --hoodie-conf hoodie.datasource.hive_sync.table=features_accounts --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool --hoodie-conf hoodie.datasource.write.recordkey.field=id,pos --hoodie-conf hoodie.datasource.write.precombine.field=id --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator --hoodie-conf hoodie.datasource.write.partitionpath.field=created_at_dt --hoodie-conf hoodie.datasource.hive_sync.partition_fields=created_at_dt --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp .type=DATE_STRING --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd'T'HH:mm:ss.SSSSSS'Z',yyyy-MM-dd' 'HH:mm:ss.SSSSSS,yyyy-MM-dd' 'HH:mm:ss,yyyy-MM-dd'T'HH:mm:ss'Z'" --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd --source-ordering-field id --hoodie-conf secret.key.name=credit-underwriting-secret --hoodie-conf transformer.decrypt.cols=features_json --hoodie-conf transformer.uncompress.cols=false --hoodie-conf transformer.jsonToStruct.column=features_json --hoodie-conf transformer.normalize.column=features_json.accounts --hoodie-conf transformer.copy.fields=created_at,created_at_dt --transformer-class com.generic.transform.DecryptTransformer,com.generic.transform.JsonToStructTypeTransformer,com.generic.transform.NormalizeArrayTransformer,com.generic.transform.FlatteningTransformer,com.generic.transform.CopyFieldTransformer``` **Stacktrace** ```23/10/17 12:13:41 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieException: Unable to load class at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:58) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:68) at org.apache.spark.sql.hudi.analysis.HoodieAnalysis$.instantiateKlass(HoodieAnalysis.scala:141) at org.apache.spark.sql.hudi.analysis.HoodieAnalysis$.customOptimizerRules(HoodieAnalysis.scala:118) at org.apache.spark.sql.hudi.HoodieSparkSessionExtension.apply(HoodieSparkSessionExtension.scala:43) at org.apache.spark.sql.hudi.HoodieSparkSessionExtension.apply(HoodieSparkSessionExtension.scala:28) at org.apache.spark.sql.SparkSession$.$anonfun$applyExtensions$1(SparkSession.scala:1197) at org.apache.spark.sql.SparkSession$.$anonfun$applyExtensions$1$adapted(SparkSession.scala:1192) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$applyExtensions(SparkSession.scala:1192) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:956) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:655) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:152) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:125) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:592) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:740) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.Spark32NestedSchemaPruning at java.lang.ClassLoader.findClass(ClassLoader.java:523) at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40) at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:55) ... 21 more``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
