MathurCodes1 opened a new issue, #7392: URL: https://github.com/apache/hudi/issues/7392
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. I'm facing issue while reading the MOR _rt table using spark.sql() and using snapshot query. A clear and concise description of the problem. I'm trying to load a MOR partitioned table with dummy and trying to read it using spark in the glue job itself. But if I try to read the MOR _rt table using spark.sql("select * from city_rt") or through a snapshot query like val tableDF = spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).load("s3://input-hudi-poc/mor_output/city/" + "/*/*") I'm getting an error as: Exception in User Class: java.lang.NoSuchMethodError : org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V **To Reproduce** Steps to reproduce the behavior: I'm using Glue version 3 to run the jobs.  JAR USED: hudi-spark3-bundle_2.12-0.10.0.jar calcite-core-1.16.0.jar libfb303-0.9.3.jar Script and hudi config -> ``` val morDF=Seq((1,"Jaipur","India",1990,3),(2,"Birmingham","England",1990,4)).toDF("id","city","country","year","month") val tableName = "city" val dbName = "default" val tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL val tablePrecombine = "id" val tableRecordKey = "id" val partitionKey="year,month" val sparkSaveMode = SaveMode.Overwrite //val sparkSaveMode = SaveMode.Append val outputDir = "s3://input-hudi-poc/mor_output/city/" //val writeOperation = DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL val writeOperation = DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL val writeOperationUpsert=DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL //"org.apache.hudi.keygen.NonpartitionedKeyGenerator" //"org.apache.hudi.hive.NonPartitionedExtractor" //org.apache.hudi.keygen.ComplexKeyGenerator //org.apache.hudi.hive.MultiPartKeysValueExtractor val hudiCommonOptions: Map[String, String] = Map( "hoodie.table.name" -> tableName, "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.precombine.field" -> tablePrecombine, "hoodie.datasource.write.recordkey.field" -> tableRecordKey, "hoodie.datasource.write.row.writer.enable" -> "true", "hoodie.datasource.write.reconcile.schema" -> "false", "hoodie.datasource.write.partitionpath.field" -> partitionKey, "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.hive_sync.table" -> tableName, "hoodie.datasource.hive_sync.database" -> dbName, "hoodie.datasource.hive_sync.partition_fields" -> partitionKey, "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.combine.before.upsert" -> "true", "hoodie.index.type" -> "BLOOM", "spark.hadoop.parquet.avro.write-old-list-structure" -> "false" ) val hudiAdvancedOptions: Map[String, String] = Map( DataSourceWriteOptions.TABLE_TYPE.key() -> tableType, "hoodie.compact.inline" -> "false", "hoodie.compact.schedule.inline" -> "true", "hoodie.compact.inline.trigger.strategy" -> "NUM_COMMITS", "hoodie.compact.inline.max.delta.commits" -> "2", "hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS", "hoodie.cleaner.commits.retained" -> "3" ) morDF.write.format("org.apache.hudi") .options(hudiCommonOptions) .options(hudiAdvancedOptions) .option("hoodie.datasource.write.operation",writeOperation) .mode(sparkSaveMode) .save(outputDir) val tableDF = spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).load("s3://input-hudi-poc/mor_output/city/" + "/*/*") print(tableDF.show(false)) ``` **Expected behavior** The snapshot query or spark.sql() made for city_rt table should return the data for _rt table. A clear and concise description of what you expected to happen. The snapshot query or spark.sql() made for city_rt table should return the data for _rt table. By querying the table I wanted to know the latest commit time and rows inserted in that commit time. **Environment Description** * Hudi version : 0.10.0 * Spark version : 3.1 * Hive version : hive-exec-2.3.7-amzn-4-core.jar * Hadoop version : hadoop-mapreduce-client-common-3.2.1-amzn-3.jar * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` ``` 2022-12-06 21:06:32,671 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.<init>(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V** at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildFileIndex$3(MergeOnReadSnapshotRelation.scala:174) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.IterableLike.foreach(IterableLike.scala:70) at scala.collection.IterableLike.foreach$(IterableLike.scala:69) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike.map(TraversableLike.scala:233) at scala.collection.TraversableLike.map$(TraversableLike.scala:226) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:169) at org.apache.hudi.MergeOnReadSnapshotRelation.buildScan(MergeOnReadSnapshotRelation.scala:110) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:356) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:389) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:466) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:388) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:356) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:480) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:485) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:156) at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:156) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:156) at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:154) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1425) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:480) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:486) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67) at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:426) at org.apache.spark.sql.execution.QueryExecution.$anonfun$sparkPlan$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:163) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:104) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:117) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:163) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:114) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:245) at org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:487) at org.apache.spark.sql.execution.QueryExecution.writePlans(QueryExecution.scala:245) at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:260) at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:216) at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:195) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:102) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3722) at org.apache.spark.sql.Dataset.head(Dataset.scala:2762) at org.apache.spark.sql.Dataset.take(Dataset.scala:2969) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:302) at org.apache.spark.sql.Dataset.showString(Dataset.scala:339) at org.apache.spark.sql.Dataset.show(Dataset.scala:867) at org.apache.spark.sql.Dataset.show(Dataset.scala:844) at GlueApp$.main(DataLoaderMain.scala:102) at GlueApp.main(DataLoaderMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:48) at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:48) at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:78) at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:143) at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:30) at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala) 2022-12-06 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
