alberttwong commented on issue #12963: URL: https://github.com/apache/hudi/issues/12963#issuecomment-2719180052
``` [ec2-user@ip-10-0-10-186 ~]$ ssh [email protected] The authenticity of host 'ip-10-0-111-168.us-west-2.compute.internal (10.0.111.168)' can't be established. ED25519 key fingerprint is SHA256:KSKj/vzbIeI5Je8YdxSLKfdSWnuxRuSvXfyh0G/Q6hc. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'ip-10-0-111-168.us-west-2.compute.internal' (ED25519) to the list of known hosts. , #_ ~\_ ####_ Amazon Linux 2 ~~ \_#####\ ~~ \###| AL2 End of Life is 2026-06-30. ~~ \#/ ___ ~~ V~' '-> ~~~ / A newer version of Amazon Linux is available! ~~._. _/ _/ _/ Amazon Linux 2023, GA and supported until 2028-03-15. _/m/' https://aws.amazon.com/linux/amazon-linux-2023/ 14 package(s) needed for security, out of 15 available Run "sudo yum update" to apply all updates. EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R E::::E M::::::M:::M M:::M::::::M R:::R R::::R E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R E::::E M:::::M M:::M M:::::M R:::R R::::R E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR [hadoop@ip-10-0-111-168 ~]$ ls /tmp d33409ee-a9f6-455a-9d8c-ce9190ccab8c_resources hadoop14465626265819506123.tmp hadoop2990841558949191269.tmp hadoop-hdfs-namenode.pid hadoop-unjar416709301599708946 hadoop-unjar775367731795166908 hadoop-yarn-yarn hive hsperfdata_hadoop hsperfdata_hdfs hsperfdata_hive hsperfdata_kms hsperfdata_livy hsperfdata_mapred hsperfdata_root hsperfdata_spark hsperfdata_tomcat hsperfdata_trino hsperfdata_yarn hsperfdata_zookeeper jetty-0_0_0_0-10002-hive-service-3_1_3-amzn-8_jar-_-any-4478184200613831529 jetty-10_0_111_168-8088-hadoop-yarn-common-3_3_6-amzn-1_jar-_-any-9187720009424936355 jetty-ip-10-0-111-168_us-west-2_compute_internal-19888-hadoop-yarn-common-3_3_6-amzn-1_jar-_-any-4187551277844480937 jetty-ip-10-0-111-168_us-west-2_compute_internal-20888-hadoop-yarn-common-3_3_6-amzn-1_jar-_-any-6642537296133348853 jetty-ip-10-0-111-168_us-west-2_compute_internal-8188-hadoop-yarn-common-3_3_6-amzn-1_jar-_-any-542379550446565138 jetty-ip-10-0-111-168_us-west-2_compute_internal-9870-hdfs-_-any-3613376253096943183 libleveldbjni-64-1-4334357513568310058.8 motd.FDkzr motd.partpBPTq puppet_bigtop_file_merge puppet_inter_thread_comm6040914042171461435 systemd-private-f33283165519454fb0af11097df7cb3b-chronyd.service-VnbyxS systemd-private-f33283165519454fb0af11097df7cb3b-httpd.service-N2bZxI systemd-private-f33283165519454fb0af11097df7cb3b-mariadb.service-7bYeRo systemd-private-f33283165519454fb0af11097df7cb3b-nginx.service-5pcRUa zstd5629203585542328589.tmp zstd8507920086534566345.tmp [hadoop@ip-10-0-111-168 ~]$ export PYSPARK_PYTHON=$(which python3) [hadoop@ip-10-0-111-168 ~]$ export SPARK_VERSION=3.4 [hadoop@ip-10-0-111-168 ~]$ pyspark --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:1.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' Python 3.7.16 (default, Feb 8 2025, 00:19:05) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] on linux Type "help", "copyright", "credits" or "license" for more information. :: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/hadoop/.ivy2/cache The jars for the packages stored in: /home/hadoop/.ivy2/jars org.apache.hudi#hudi-spark3.4-bundle_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-49fa9c8e-fa5b-42d6-a6ce-158854b6158f;1.0 confs: [default] found org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1 in central found org.apache.hive#hive-storage-api;2.8.1 in central found org.slf4j#slf4j-api;1.7.36 in central downloading https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/1.0.1/hudi-spark3.4-bundle_2.12-1.0.1.jar ... [SUCCESSFUL ] org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1!hudi-spark3.4-bundle_2.12.jar (1106ms) downloading https://repo1.maven.org/maven2/org/apache/hive/hive-storage-api/2.8.1/hive-storage-api-2.8.1.jar ... [SUCCESSFUL ] org.apache.hive#hive-storage-api;2.8.1!hive-storage-api.jar (25ms) downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.36/slf4j-api-1.7.36.jar ... [SUCCESSFUL ] org.slf4j#slf4j-api;1.7.36!slf4j-api.jar (31ms) :: resolution report :: resolve 1145ms :: artifacts dl 1167ms :: modules in use: org.apache.hive#hive-storage-api;2.8.1 from central in [default] org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1 from central in [default] org.slf4j#slf4j-api;1.7.36 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 3 | 3 | 0 || 3 | 3 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-49fa9c8e-fa5b-42d6-a6ce-158854b6158f confs: [default] 3 artifacts copied, 0 already retrieved (108331kB/66ms) Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/03/12 21:31:31 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist 25/03/12 21:31:33 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 25/03/12 21:31:39 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-spark3.4-bundle_2.12-1.0.1.jar added multiple times to distributed cache. 25/03/12 21:31:39 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.apache.hive_hive-storage-api-2.8.1.jar added multiple times to distributed cache. 25/03/12 21:31:39 WARN Client: Same path resource file:///home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.36.jar added multiple times to distributed cache. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.4.1-amzn-2 /_/ Using Python version 3.7.16 (default, Feb 8 2025 00:19:05) Spark context Web UI available at http://ip-10-0-111-168.us-west-2.compute.internal:4040 Spark context available as 'sc' (master = yarn, app id = application_1741814916834_0001). SparkSession available as 'spark'. >>> from pyspark.sql.functions import lit, col >>> >>> tableName = "trips_table" >>> basePath = "file:///tmp/trips_table" >>> columns = ["ts","uuid","rider","driver","fare","city"] >>> data =[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"), ... (1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"), ... (1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"), ... (1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"), ... (1695115999911,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai")] >>> inserts = spark.createDataFrame(data).toDF(*columns) >>> >>> hudi_options = { ... 'hoodie.table.name': tableName, ... 'hoodie.datasource.write.partitionpath.field': 'city' ... } >>> >>> inserts.write.format("hudi"). \ ... options(**hudi_options). \ ... mode("overwrite"). \ ... save(basePath) 25/03/12 21:32:09 WARN HoodieSparkSqlWriterInternal: Choosing BULK_INSERT as the operation type since auto record key generation is applicable 25/03/12 21:32:09 INFO HiveConf: Found configuration file file:/etc/spark/conf.dist/hive-site.xml 25/03/12 21:32:09 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist 25/03/12 21:32:09 INFO metastore: Trying to connect to metastore with URI thrift://ip-10-0-111-168.us-west-2.compute.internal:9083 25/03/12 21:32:09 INFO metastore: Opened a connection to metastore, current connections: 1 25/03/12 21:32:09 INFO metastore: Connected to metastore. 25/03/12 21:32:13 WARN HoodieTableFileSystemView: Partition: files is not available in store 25/03/12 21:32:13 WARN HoodieTableFileSystemView: Partition: files is not available in store Traceback (most recent call last): File "<stdin>", line 4, in <module> File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 1398, in save self._jwrite.save(path) File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in __call__ File "/usr/lib/spark/python/pyspark/errors/exceptions/captured.py", line 169, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o124.save. : org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata table at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:309) at org.apache.hudi.client.SparkRDDWriteClient.initMetadataTable(SparkRDDWriteClient.java:271) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$doInitTable$7(BaseHoodieWriteClient.java:1305) at org.apache.hudi.client.BaseHoodieWriteClient.executeUsingTxnManager(BaseHoodieWriteClient.java:1312) at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1302) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1352) at org.apache.hudi.commit.BaseDatasetBulkInsertCommitActionExecutor.execute(BaseDatasetBulkInsertCommitActionExecutor.java:100) at org.apache.hudi.HoodieSparkSqlWriterInternal.bulkInsertAsRow(HoodieSparkSqlWriter.scala:832) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:494) at org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$write$1(HoodieSparkSqlWriter.scala:192) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.adapter.BaseSpark3Adapter.sqlExecutionWithNewExecutionId(BaseSpark3Adapter.scala:105) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:214) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:129) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:170) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT partition files should be > 0 at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1442) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.commitInternal(HoodieBackedTableMetadataWriter.java:1349) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.bulkCommit(SparkHoodieBackedTableMetadataWriter.java:149) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:489) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:280) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.<init>(HoodieBackedTableMetadataWriter.java:189) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.<init>(SparkHoodieBackedTableMetadataWriter.java:114) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:91) at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:303) ... 71 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
