mtami opened a new issue #4008:
URL: https://github.com/apache/hudi/issues/4008
**Describe the problem you faced**
I am trying to bulk_insert a small table (~150MB) into s3 using Apache hudi.
I want to partition the data based on yyyy/MM/dd using
hive_style_partitioning.
The table (with partition subfolder) is created successfully on S3,
However, Hudi failed with the following :
```
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync
partitions for table brand_tree
...
Caused by: java.lang.IllegalArgumentException: Partition key parts [created]
does not match with partition values [2019, 01, 17]. Check partition strategy.
```
here my bulk_insert configuration:
```
{
'hoodie.bulkinsert.shuffle.parallelism': 3,
'hoodie.datasource.write.operation': 'bulk_insert',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.CustomKeyGenerator',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd',
'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-MM-dd
HH:mm:ss',
'hoodie.datasource.write.partitionpath.field': f'{partition_key}:TIMESTAMP',
'hoodie.datasource.hive_sync.partition_fields':
f'{partition_key}:TIMESTAMP',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex':
'',
'hoodie.deltastreamer.keygen.timebased.input.timezone': 'GMT',
'className': 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc':
'false',
'hoodie.datasource.write.precombine.field': pre_combine_key,
'hoodie.datasource.write.recordkey.field': ','.join(record_keys),
'hoodie.table.name': table_name,
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.hive_sync.database': db_name,
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.hive_sync.table': table_name,
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.write.table.type': "COPY_ON_WRITE",
'hoodie.index.type': "GLOBAL_SIMPLE"
}
```
And here the Python:Traceback:
```
2021-11-16 11:26:57,813 ERROR [main] glue.ProcessLauncher
(Logging.scala:logError(70)): Error from Python:Traceback (most recent call
last):
File "/tmp/dev-poc", line 357, in <module>
main()
File "/tmp/dev-poc", line 353, in main
rds_job_driver.run()
File "/tmp/dev-poc", line 228, in run
self.transform()
File "/tmp/dev-poc", line 330, in transform
partition_key=partition_key)
File "/tmp/dev-poc", line 176, in overwrite
self.write_df(data_frame, write_mode, target_path, self.data_format,
**combined_conf)
File "/tmp/dev-poc", line 91, in write_df
save(target)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 734, in save
self._jwrite.save(path)
File
"/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line
1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o164.save.
: org.apache.hudi.exception.HoodieException: Got runtime exception when hive
syncing brand_tree_partition
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:132)
at
org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:425)
at
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:479)
at
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:475)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:475)
at
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:548)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:238)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:170)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync
partitions for table brand_tree_partition
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:332)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:188)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:118)
... 40 more
Caused by: java.lang.IllegalArgumentException: Partition key parts [created]
does not match with partition values [2019, 01, 17]. Check partition strategy.
at
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
at
org.apache.hudi.hive.HoodieHiveClient.getPartitionClause(HoodieHiveClient.java:223)
at
org.apache.hudi.hive.HoodieHiveClient.constructAddPartitions(HoodieHiveClient.java:199)
at
org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:143)
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:327)
... 42 more
```
**1. sample partition path: s3://schema_name/table_name/created=2021/11/10/
-> this will have parquet files as output
2. pertaining using hive style
3. I am converting partition field (created) to string before ingesting/save
to hudi table on S3**
**Environment Description**
* Hudi version : 0.9.0
* Spark version : 2.4/ PySpark
* Hadoop version : 2.7.3
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]