imuntyan opened a new issue, #7820:
URL: https://github.com/apache/hudi/issues/7820
Enabling the hudi cleaner service (sync or async) throws an error when
trying to upsert a record in spark append mode.
**To Reproduce**
I am running the following script in the Jupyter notebook:
```python
from numpy import random
hudi_mode, spark_mode = 'upsert', 'append'
insert_hudi_path = "s3://[redacted]/hudi/test01"
data = [
{"id": f'id-{random.randint(1000000000)}', "text":
f'text-{random.randint(1000000000)}'}
]
df = spark.createDataFrame(data)
table_name = 'table_test01'
primary_key="id"
precombine="text"
hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.operation': hudi_mode,
'hoodie.datasource.write.recordkey.field': primary_key,
'hoodie.datasource.write.precombine.field': precombine,
'hoodie.metadata.enable': True,
'hoodie.clean.automatic': True,
# 'hoodie.clean.async': True,
'hoodie.cleaner.commits.retained': 1,
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.NonpartitionedKeyGenerator'
}
df.write.format('hudi') \
.options(**hudi_options) \
.mode(spark_mode) \
.save(insert_hudi_path)
```
When the S3 location is empty this script executes fine and creates the data
in S3. When executing it again, it throws the errors attached below. The errors
are thrown for both sync and async cleaner mode (the async mode throws the
errors on the third run though).
The errors are not returned when the following configuration is commented
out:
```
# 'hoodie.clean.automatic': True,
# 'hoodie.clean.async': True,
# 'hoodie.cleaner.commits.retained': 1,
```
**Expected behavior**
No errors.
**Environment Description**
* Hudi version : 0.12.1-amzn-0-SNAPSHOT
* Spark version : 3.3.0
* Hive version : 3.1.3
* Hadoop version : 3.3.3
* AWS EMR version: emr-6.9.0
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : EMR on EC2
**Stacktrace**
[emr-errors.txt](https://github.com/apache/hudi/files/10560680/emr-errors.txt)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]