alberttwong opened a new issue, #12963: URL: https://github.com/apache/hudi/issues/12963
**Describe the problem you faced** Following the quickstart at https://hudi.apache.org/docs/quick-start-guide/ on EMR 6.15. **To Reproduce** Steps to reproduce the behavior: ``` [hadoop@ip-10-0-125-28 ~]$ export SPARK_VERSION=3.4 [hadoop@ip-10-0-125-28 ~]$ rm hudi-spark3.4-bundle_2.12-1.0.1.jar [hadoop@ip-10-0-125-28 ~]$ ls [hadoop@ip-10-0-125-28 ~]$ export PYSPARK_PYTHON=$(which python3) [hadoop@ip-10-0-125-28 ~]$ pyspark --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:1.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' Python 3.7.16 (default, Feb 8 2025, 00:19:05) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] on linux Type "help", "copyright", "credits" or "license" for more information. :: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/hadoop/.ivy2/cache The jars for the packages stored in: /home/hadoop/.ivy2/jars org.apache.hudi#hudi-spark3.4-bundle_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-daca4bcd-853b-4f9b-8233-340110eac111;1.0 confs: [default] found org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1 in central found org.apache.hive#hive-storage-api;2.8.1 in central found org.slf4j#slf4j-api;1.7.36 in central downloading https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/1.0.1/hudi-spark3.4-bundle_2.12-1.0.1.jar ... [SUCCESSFUL ] org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1!hudi-spark3.4-bundle_2.12.jar (1294ms) downloading https://repo1.maven.org/maven2/org/apache/hive/hive-storage-api/2.8.1/hive-storage-api-2.8.1.jar ... [SUCCESSFUL ] org.apache.hive#hive-storage-api;2.8.1!hive-storage-api.jar (29ms) downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.36/slf4j-api-1.7.36.jar ... [SUCCESSFUL ] org.slf4j#slf4j-api;1.7.36!slf4j-api.jar (21ms) :: resolution report :: resolve 1080ms :: artifacts dl 1347ms :: modules in use: org.apache.hive#hive-storage-api;2.8.1 from central in [default] org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1 from central in [default] org.slf4j#slf4j-api;1.7.36 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 3 | 3 | 0 || 3 | 3 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-daca4bcd-853b-4f9b-8233-340110eac111 confs: [default] 3 artifacts copied, 0 already retrieved (108331kB/65ms) Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/03/12 18:34:17 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist 25/03/12 18:34:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 25/03/12 18:34:26 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-spark3.4-bundle_2.12-1.0.1.jar added multiple times to distributed cache. 25/03/12 18:34:26 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hive_hive-storage-api-2.8.1.jar added multiple times to distributed cache. 25/03/12 18:34:26 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.36.jar added multiple times to distributed cache. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.4.1-amzn-2 /_/ Using Python version 3.7.16 (default, Feb 8 2025 00:19:05) Spark context Web UI available at http://ip-10-0-125-28.us-west-2.compute.internal:4040 Spark context available as 'sc' (master = yarn, app id = application_1741804316774_0001). SparkSession available as 'spark'. >>> from pyspark.sql.functions import lit, col >>> >>> tableName = "trips_table" >>> basePath = "file:///tmp/trips_table" >>> columns = ["ts","uuid","rider","driver","fare","city"] >>> data =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"), ... (1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"), ... (1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"), ... (1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"), ... (1695115999911,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai")] >>> inserts = spark.createDataFrame(data).toDF(*columns) >>> >>> hudi_options = { ... 'hoodie.table.name': tableName, ... 'hoodie.datasource.write.partitionpath.field': 'city' ... } >>> >>> inserts.write.format("hudi"). \ ... options(**hudi_options). \ ... mode("overwrite"). \ ... save(basePath) 25/03/12 18:35:00 WARN HoodieSparkSqlWriterInternal: Choosing BULK_INSERT as the operation type since auto record key generation is applicable 25/03/12 18:35:00 INFO HiveConf: Found configuration file file:/etc/spark/conf.dist/hive-site.xml 25/03/12 18:35:00 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist 25/03/12 18:35:00 INFO metastore: Trying to connect to metastore with URI thrift://ip-10-0-125-28.us-west-2.compute.internal:9083 25/03/12 18:35:00 INFO metastore: Opened a connection to metastore, current connections: 1 25/03/12 18:35:00 INFO metastore: Connected to metastore. 25/03/12 18:35:04 WARN HoodieTableFileSystemView: Partition: files is not available in store 25/03/12 18:35:04 WARN HoodieTableFileSystemView: Partition: files is not available in store Traceback (most recent call last): File "<stdin>", line 4, in <module> File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 1398, in save self._jwrite.save(path) File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in __call__ File "/usr/lib/spark/python/pyspark/errors/exceptions/captured.py", line 169, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o124.save. : org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata table at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:309) at org.apache.hudi.client.SparkRDDWriteClient.initMetadataTable(SparkRDDWriteClient.java:271) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$doInitTable$7(BaseHoodieWriteClient.java:1305) at org.apache.hudi.client.BaseHoodieWriteClient.executeUsingTxnManager(BaseHoodieWriteClient.java:1312) at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1302) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1352) at org.apache.hudi.commit.BaseDatasetBulkInsertCommitActionExecutor.execute(BaseDatasetBulkInsertCommitActionExecutor.java:100) at org.apache.hudi.HoodieSparkSqlWriterInternal.bulkInsertAsRow(HoodieSparkSqlWriter.scala:832) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:494) at org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$write$1(HoodieSparkSqlWriter.scala:192) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.adapter.BaseSpark3Adapter.sqlExecutionWithNewExecutionId(BaseSpark3Adapter.scala:105) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:214) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:129) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:170) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT partition files should be > 0 at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1442) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.commitInternal(HoodieBackedTableMetadataWriter.java:1349) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.bulkCommit(SparkHoodieBackedTableMetadataWriter.java:149) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:489) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:280) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.<init>(HoodieBackedTableMetadataWriter.java:189) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.<init>(SparkHoodieBackedTableMetadataWriter.java:114) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:91) at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:303) ... 71 more ``` **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 1.0.1 * Spark version : 3.4 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
