imuntyan opened a new issue, #7820:
URL: https://github.com/apache/hudi/issues/7820

   Enabling the hudi cleaner service (sync or async) throws an error when 
trying to upsert a record in spark append mode.
   
   **To Reproduce**
   
   I am running the following script in the Jupyter notebook:
   ```python
   from numpy import random
   
   hudi_mode, spark_mode = 'upsert', 'append'
   insert_hudi_path = "s3://[redacted]/hudi/test01"
   
   data = [
       {"id": f'id-{random.randint(1000000000)}', "text": 
f'text-{random.randint(1000000000)}'}
       ]
            
   df = spark.createDataFrame(data)
   
   table_name = 'table_test01'
   primary_key="id"
   precombine="text"
   hudi_options = {
           'hoodie.table.name': table_name,
           'hoodie.datasource.write.operation': hudi_mode,
           'hoodie.datasource.write.recordkey.field': primary_key,
           'hoodie.datasource.write.precombine.field': precombine,
           'hoodie.metadata.enable': True,
           'hoodie.clean.automatic': True,
   #         'hoodie.clean.async': True,
           'hoodie.cleaner.commits.retained': 1,
           'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator'
       }
   
   df.write.format('hudi') \
           .options(**hudi_options) \
           .mode(spark_mode) \
           .save(insert_hudi_path)
   ```
   When the S3 location is empty this script executes fine and creates the data 
in S3. When executing it again, it throws the errors attached below. The errors 
are thrown for both sync and async cleaner mode (the async mode throws the 
errors on the third run though).
   
   The errors are not returned when the following configuration is commented 
out:
   ```
   #        'hoodie.clean.automatic': True,
   #        'hoodie.clean.async': True,
   #        'hoodie.cleaner.commits.retained': 1,
   
   ```
   
   **Expected behavior**
   
   No errors.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-amzn-0-SNAPSHOT
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * AWS EMR version: emr-6.9.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : EMR on EC2
   
   
   **Stacktrace**
   
   
[emr-errors.txt](https://github.com/apache/hudi/files/10560680/emr-errors.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to