[GitHub] [hudi] ketkidev opened a new issue, #9674: [SUPPORT]: Data loss with Concurrent operations on Hudi MOR

via GitHub Mon, 11 Sep 2023 05:22:09 -0700


ketkidev opened a new issue, #9674:
URL: https://github.com/apache/hudi/issues/9674


   **Describe the problem you faced**
   While running two concurrent operations on Hudi MOR Table , we are facing 
data loss.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   1. Run upsert at 10:30 AM on Husi MOR Table 
   2. Run cleaner utility on same MOR table 10:31 AM while upsert is still 
going on  
   
   **Note**: Hoodie Cleaner is running as a separate process
   Hudi MOR Table configuration: 
   ```
   {    
           'hoodie.table.name': asset,                                      
           'hoodie.datasource.write.recordkey.field': id,           
           'hoodie.datasource.write.table.name': asset,                 
           'hoodie.upsert.shuffle.parallelism': 400,                
           'hoodie.keep.max.commits': 50,                      
           'hoodie.keep.min.commits': 49,
           'hoodie.compact.inline.max.delta.commits': 6,
           'hoodie.clean.automatic': 'false',
           'hoodie.clean.async': 'false'
   }
   ```
   Cleaner Utility:
   ```
   /usr/local/bin/spark-submit
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain
 
   --conf 
spark.hadoop.fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
   --conf 
spark.jars.packages=org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hadoop:hadoop-aws:3.2.2,com.amazonaws:aws-java-sdk-bundle:1.12.180,org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0
 
   --class org.apache.hudi.utilities.HoodieCleaner 
/home/ubuntu/hudi-utilities-bundle.jar 
   --target-base-path s3a://bucket_name/table_path/reference/ 
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS 
   --hoodie-conf hoodie.keep.max.commits=50
   --hoodie-conf hoodie.keep.min.commits=49
   --hoodie-conf hoodie.cleaner.commits.retained=48
   --hoodie-conf hoodie.cleaner.parallelism=400
   ```
   **Expected behavior**
   
   All data should be available in Hudi MOR Table.
   
   **Environment Description**
   
   * Hudi version : 0.13
   
   * Spark version : 3.3.2 
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ketkidev opened a new issue, #9674: [SUPPORT]: Data loss with Concurrent operations on Hudi MOR

Reply via email to