[GitHub] [hudi] imuntyan opened a new issue, #7820: [SUPPORT] Errors are thrown when upserting a record with cleaner service enabled

via GitHub Wed, 01 Feb 2023 11:00:36 -0800


imuntyan opened a new issue, #7820:
URL: https://github.com/apache/hudi/issues/7820


   Enabling the hudi cleaner service (sync or async) throws an error when 
trying to upsert a record in spark append mode.
   
   **To Reproduce**
   
   I am running the following script in the Jupyter notebook:
   ```python
   from numpy import random
   
   hudi_mode, spark_mode = 'upsert', 'append'
   insert_hudi_path = "s3://[redacted]/hudi/test01"
   
   data = [
       {"id": f'id-{random.randint(1000000000)}', "text": 
f'text-{random.randint(1000000000)}'}
       ]
            
   df = spark.createDataFrame(data)
   
   table_name = 'table_test01'
   primary_key="id"
   precombine="text"
   hudi_options = {
           'hoodie.table.name': table_name,
           'hoodie.datasource.write.operation': hudi_mode,
           'hoodie.datasource.write.recordkey.field': primary_key,
           'hoodie.datasource.write.precombine.field': precombine,
           'hoodie.metadata.enable': True,
           'hoodie.clean.automatic': True,
   #         'hoodie.clean.async': True,
           'hoodie.cleaner.commits.retained': 1,
           'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator'
       }
   
   df.write.format('hudi') \
           .options(**hudi_options) \
           .mode(spark_mode) \
           .save(insert_hudi_path)
   ```
   When the S3 location is empty this script executes fine and creates the data 
in S3. When executing it again, it throws the errors attached below. The errors 
are thrown for both sync and async cleaner mode (the async mode throws the 
errors on the third run though).
   
   The errors are not returned when the following configuration is commented 
out:
   ```
   #        'hoodie.clean.automatic': True,
   #        'hoodie.clean.async': True,
   #        'hoodie.cleaner.commits.retained': 1,
   
   ```
   
   **Expected behavior**
   
   No errors.
   
   **Environment Description**
   
   * Hudi version : 0.12.1-amzn-0-SNAPSHOT
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * AWS EMR version: emr-6.9.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : EMR on EC2
   
   
   **Stacktrace**
   
   
[emr-errors.txt](https://github.com/apache/hudi/files/10560680/emr-errors.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] imuntyan opened a new issue, #7820: [SUPPORT] Errors are thrown when upserting a record with cleaner service enabled

Reply via email to