[GitHub] [hudi] mtami opened a new issue #4008: [SUPPORT] Hudi failed to sync new partition table to glue data catalog

GitBox Tue, 16 Nov 2021 04:46:40 -0800


mtami opened a new issue #4008:
URL: https://github.com/apache/hudi/issues/4008



   
   
   **Describe the problem you faced**
   
   I am trying to bulk_insert a small table (~150MB) into s3 using Apache hudi.
   I want to partition the data based on yyyy/MM/dd  using 
hive_style_partitioning.
   The table (with partition subfolder) is created successfully on S3,
   However, Hudi failed with the following :
   ```
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
partitions for table brand_tree
   ...
   Caused by: java.lang.IllegalArgumentException: Partition key parts [created] 
does not match with partition values [2019, 01, 17]. Check partition strategy. 
   ```
   
   
   here my bulk_insert configuration:
   ```
   {
   'hoodie.bulkinsert.shuffle.parallelism': 3,
   'hoodie.datasource.write.operation': 'bulk_insert',
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.CustomKeyGenerator',
    'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
   'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd',
   'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-MM-dd 
HH:mm:ss',
   'hoodie.datasource.write.partitionpath.field': f'{partition_key}:TIMESTAMP',
    'hoodie.datasource.hive_sync.partition_fields': 
f'{partition_key}:TIMESTAMP',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   
'hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex': 
'',
   'hoodie.deltastreamer.keygen.timebased.input.timezone': 'GMT',
   'className': 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 
'false',
   'hoodie.datasource.write.precombine.field': pre_combine_key,
   'hoodie.datasource.write.recordkey.field': ','.join(record_keys),
   'hoodie.table.name': table_name,
   'hoodie.consistency.check.enabled': 'true', 
'hoodie.datasource.hive_sync.database': db_name,
   'hoodie.datasource.write.table.name': table_name,
   'hoodie.datasource.hive_sync.table': table_name, 
'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.write.table.type': "COPY_ON_WRITE",
   'hoodie.index.type': "GLOBAL_SIMPLE"
   }
   
   ```
   And here the Python:Traceback:
   ```
   2021-11-16 11:26:57,813 ERROR [main] glue.ProcessLauncher 
(Logging.scala:logError(70)): Error from Python:Traceback (most recent call 
last):
     File "/tmp/dev-poc", line 357, in <module>
       main()
     File "/tmp/dev-poc", line 353, in main
       rds_job_driver.run()
     File "/tmp/dev-poc", line 228, in run
       self.transform()
     File "/tmp/dev-poc", line 330, in transform
       partition_key=partition_key)
     File "/tmp/dev-poc", line 176, in overwrite
       self.write_df(data_frame, write_mode, target_path, self.data_format, 
**combined_conf)
     File "/tmp/dev-poc", line 91, in write_df
       save(target)
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
line 734, in save
       self._jwrite.save(path)
     File 
"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
1257, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
63, in deco
       return f(*a, **kw)
     File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o164.save.
   : org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
syncing brand_tree_partition
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:132)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:425)
        at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:479)
        at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:475)
        at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:475)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:548)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:238)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:170)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
partitions for table brand_tree_partition
        at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:332)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:188)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:118)
        ... 40 more
   Caused by: java.lang.IllegalArgumentException: Partition key parts [created] 
does not match with partition values [2019, 01, 17]. Check partition strategy. 
        at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
        at 
org.apache.hudi.hive.HoodieHiveClient.getPartitionClause(HoodieHiveClient.java:223)
        at 
org.apache.hudi.hive.HoodieHiveClient.constructAddPartitions(HoodieHiveClient.java:199)
        at 
org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:143)
        at 
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:327)
        ... 42 more
   
   ```
   **1. sample partition path: s3://schema_name/table_name/created=2021/11/10/ 
-> this will have parquet files as output
   2. pertaining using hive style
   3. I am converting partition field (created) to string before ingesting/save 
to hudi table on S3**
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version :  2.4/ PySpark
   
   * Hadoop version : 2.7.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] mtami opened a new issue #4008: [SUPPORT] Hudi failed to sync new partition table to glue data catalog

Reply via email to