[I] [SUPPORT] Hudi Quickstart on EMR 6.15 with Hudi 1.0.1 not working [hudi]

via GitHub Wed, 12 Mar 2025 11:42:05 -0700


alberttwong opened a new issue, #12963:
URL: https://github.com/apache/hudi/issues/12963


   **Describe the problem you faced**
   
   Following the quickstart at https://hudi.apache.org/docs/quick-start-guide/ 
on EMR 6.15. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   ```
   [hadoop@ip-10-0-125-28 ~]$ export SPARK_VERSION=3.4
   [hadoop@ip-10-0-125-28 ~]$ rm hudi-spark3.4-bundle_2.12-1.0.1.jar
   [hadoop@ip-10-0-125-28 ~]$ ls
   [hadoop@ip-10-0-125-28 ~]$ export PYSPARK_PYTHON=$(which python3)
   [hadoop@ip-10-0-125-28 ~]$ pyspark --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:1.0.1 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
   Python 3.7.16 (default, Feb  8 2025, 00:19:05)
   [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   :: loading settings :: url = 
jar:file:/usr/lib/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
   Ivy Default Cache set to: /home/hadoop/.ivy2/cache
   The jars for the packages stored in: /home/hadoop/.ivy2/jars
   org.apache.hudi#hudi-spark3.4-bundle_2.12 added as a dependency
   :: resolving dependencies :: 
org.apache.spark#spark-submit-parent-daca4bcd-853b-4f9b-8233-340110eac111;1.0
        confs: [default]
        found org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1 in central
        found org.apache.hive#hive-storage-api;2.8.1 in central
        found org.slf4j#slf4j-api;1.7.36 in central
   downloading 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/1.0.1/hudi-spark3.4-bundle_2.12-1.0.1.jar
 ...
        [SUCCESSFUL ] 
org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1!hudi-spark3.4-bundle_2.12.jar 
(1294ms)
   downloading 
https://repo1.maven.org/maven2/org/apache/hive/hive-storage-api/2.8.1/hive-storage-api-2.8.1.jar
 ...
        [SUCCESSFUL ] 
org.apache.hive#hive-storage-api;2.8.1!hive-storage-api.jar (29ms)
   downloading 
https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.36/slf4j-api-1.7.36.jar 
...
        [SUCCESSFUL ] org.slf4j#slf4j-api;1.7.36!slf4j-api.jar (21ms)
   :: resolution report :: resolve 1080ms :: artifacts dl 1347ms
        :: modules in use:
        org.apache.hive#hive-storage-api;2.8.1 from central in [default]
        org.apache.hudi#hudi-spark3.4-bundle_2.12;1.0.1 from central in 
[default]
        org.slf4j#slf4j-api;1.7.36 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
        ---------------------------------------------------------------------
   :: retrieving :: 
org.apache.spark#spark-submit-parent-daca4bcd-853b-4f9b-8233-340110eac111
        confs: [default]
        3 artifacts copied, 0 already retrieved (108331kB/65ms)
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   25/03/12 18:34:17 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   25/03/12 18:34:19 WARN Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
   25/03/12 18:34:26 WARN Client: Same path resource 
file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-spark3.4-bundle_2.12-1.0.1.jar
 added multiple times to distributed cache.
   25/03/12 18:34:26 WARN Client: Same path resource 
file:///home/hadoop/.ivy2/jars/org.apache.hive_hive-storage-api-2.8.1.jar added 
multiple times to distributed cache.
   25/03/12 18:34:26 WARN Client: Same path resource 
file:///home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.36.jar added multiple 
times to distributed cache.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 3.4.1-amzn-2
         /_/
   
   Using Python version 3.7.16 (default, Feb  8 2025 00:19:05)
   Spark context Web UI available at 
http://ip-10-0-125-28.us-west-2.compute.internal:4040
   Spark context available as 'sc' (master = yarn, app id = 
application_1741804316774_0001).
   SparkSession available as 'spark'.
   >>> from pyspark.sql.functions import lit, col
   >>>
   >>> tableName = "trips_table"
   >>> basePath = "file:///tmp/trips_table"
   >>> columns = ["ts","uuid","rider","driver","fare","city"]
   >>> data 
=[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
   ...        
(1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70
 ,"san_francisco"),
   ...        
(1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90
 ,"san_francisco"),
   ...        
(1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"),
   ...        
(1695115999911,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai")]
   >>> inserts = spark.createDataFrame(data).toDF(*columns)
   >>>
   >>> hudi_options = {
   ...     'hoodie.table.name': tableName,
   ...     'hoodie.datasource.write.partitionpath.field': 'city'
   ... }
   >>>
   >>> inserts.write.format("hudi"). \
   ...     options(**hudi_options). \
   ...     mode("overwrite"). \
   ...     save(basePath)
   25/03/12 18:35:00 WARN HoodieSparkSqlWriterInternal: Choosing BULK_INSERT as 
the operation type since auto record key generation is applicable
   25/03/12 18:35:00 INFO HiveConf: Found configuration file 
file:/etc/spark/conf.dist/hive-site.xml
   25/03/12 18:35:00 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   25/03/12 18:35:00 INFO metastore: Trying to connect to metastore with URI 
thrift://ip-10-0-125-28.us-west-2.compute.internal:9083
   25/03/12 18:35:00 INFO metastore: Opened a connection to metastore, current 
connections: 1
   25/03/12 18:35:00 INFO metastore: Connected to metastore.
   25/03/12 18:35:04 WARN HoodieTableFileSystemView: Partition: files is not 
available in store
   25/03/12 18:35:04 WARN HoodieTableFileSystemView: Partition: files is not 
available in store
   Traceback (most recent call last):
     File "<stdin>", line 4, in <module>
     File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 1398, in save
       self._jwrite.save(path)
     File 
"/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 
1323, in __call__
     File "/usr/lib/spark/python/pyspark/errors/exceptions/captured.py", line 
169, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", 
line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o124.save.
   : org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata 
table
        at 
org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:309)
        at 
org.apache.hudi.client.SparkRDDWriteClient.initMetadataTable(SparkRDDWriteClient.java:271)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.lambda$doInitTable$7(BaseHoodieWriteClient.java:1305)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.executeUsingTxnManager(BaseHoodieWriteClient.java:1312)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1302)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1352)
        at 
org.apache.hudi.commit.BaseDatasetBulkInsertCommitActionExecutor.execute(BaseDatasetBulkInsertCommitActionExecutor.java:100)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.bulkInsertAsRow(HoodieSparkSqlWriter.scala:832)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:494)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$write$1(HoodieSparkSqlWriter.scala:192)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
        at 
org.apache.spark.sql.adapter.BaseSpark3Adapter.sqlExecutionWithNewExecutionId(BaseSpark3Adapter.scala:105)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:214)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:129)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:170)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530)
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
        at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387)
        at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.IllegalArgumentException: FileGroup count for MDT 
partition files should be > 0
        at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.prepRecords(HoodieBackedTableMetadataWriter.java:1442)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.commitInternal(HoodieBackedTableMetadataWriter.java:1349)
        at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.bulkCommit(SparkHoodieBackedTableMetadataWriter.java:149)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:489)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:280)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.<init>(HoodieBackedTableMetadataWriter.java:189)
        at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.<init>(SparkHoodieBackedTableMetadataWriter.java:114)
        at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:91)
        at 
org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:303)
        ... 71 more
   ```
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 1.0.1
   
   * Spark version : 3.4
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) : no
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Hudi Quickstart on EMR 6.15 with Hudi 1.0.1 not working [hudi]

Reply via email to