[GitHub] [hudi] gtwuser opened a new issue, #5880: Need help in removing duplicate commit files from S3 after upsert

GitBox Wed, 15 Jun 2022 14:05:01 -0700


gtwuser opened a new issue, #5880:
URL: https://github.com/apache/hudi/issues/5880


   **Describe the problem you faced**
   After upsert records are duplicated across commits in s3. Since S3 is the 
source for downstream we need to remove the old commits which are in form of 
parquet files and retain only the latest commits. 
   A clear and concise description of the problem.
   Configs
   Its using default `COW` 
   ```
   commonConfig = {
               'className': 'org.apache.hudi', 
'hoodie.datasource.hive_sync.use_jdbc': 'false',
               # 'hoodie.datasource.hive_sync.mode': 'hms',
               'hoodie.datasource.write.precombine.field': 'ModTime',
               'hoodie.datasource.write.recordkey.field': 'Moid',
               'hoodie.table.name': 'intersight', 
               'hoodie.consistency.check.enabled': 'true',
               'hoodie.datasource.hive_sync.database': args['database_name'],
               'hoodie.datasource.write.reconcile.schema': 'true',
               'hoodie.datasource.hive_sync.table': 'intersight' + 
prefix.replace("/", "_").lower(),
               'hoodie.datasource.hive_sync.enable': 'true', 'path': 's3://' + 
args['curated_bucket'] + '/intersight' + prefix,
               'hoodie.parquet.small.file.limit': '134217728' # 1,024 * 1,024 * 
128 = 134,217,728 (128 MB)
           }
   incrementalConfig = {
               'hoodie.upsert.shuffle.parallelism': 68, 
'hoodie.datasource.write.operation': 'upsert',
               'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 
'hoodie.cleaner.commits.retained': 10
           }
   ```
   saved with : 
   ```bash
    inputDf.write \
           .format('org.apache.hudi') \
           .option("spark.hadoop.parquet.avro.write-old-list-structure", 
"false") \
           .option("parquet.avro.write-old-list-structure", "false") \
           .option("spark.hadoop.parquet.avro.add-list-element-records", 
"false") \
           .option("parquet.avro.add-list-element-records", "false") \
           .option("hoodie.parquet.avro.write-old-list-structure", "false") \
           .option("hoodie.datasource.write.reconcile.schema", "true") \
           .options(**combinedConf) \
           .mode('append') \
           .save()
   ```
   After insert when upsert is performed we see records which were inserted 
earlier are being still retained in the S3 bucket. Due to this we have multiple 
duplicate records now in the bucket. 
   
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Insert with following records:
   ```
   {"name":"rohan", "id":1}
   {"name":"rakesh", "id":2}
   ```
   2. This insertion created 2 parquet files, with one record per file
   `file1.parquet` - `{"name":"rohan", "id":1}`
   `file2.parquet` - `{"name":"rakesh", "id":2}`
   3. Add a new record and update existing one, something like this :
   ~{"name":"rohan", "id":1}~ ->  {"name":"rohan123", "id":1} # updated record
   {"name":"ram", "id":3} # new record
   4. This upsert creates another 2 new files, so now in total we have 4 files: 
   
   `file1.parquet` - `{"name":"rohan", "id":1}`# existing old files `not 
deleted yet` (these are the files which leads to duplicate records)
   `file2.parquet` - `{"name":"rakesh", "id":2}` # existing old files `not 
deleted yet` (these are the files which leads to duplicate records)
   `file3.parquet` - `{"name":"rohan123", "id":1} {"name":"rakesh", "id":2}` # 
this the updated record and another old record is merged to this `file which is 
created on upsert`
   `file4.parquet` - `{"name":"ram", "id":3}` # newly added record is added to 
this file.
   5. Now our issue is since `file1.parquet`, `file2.parquet` and 
`file3.parquet` has duplicate records the consumer is getting these `duplicate 
records` which is invalid. 
   
   **Expected behavior**
   Ideally there should not be `file1.parquet` and `file2.parquet` files after 
upsert. `file3.parquet` contains the latest changes in it for both records.
   A clear and concise description of what you expected to happen.
   With every update the files(old ones) should be deleted or marked as 
deleted, so that the consumer of s3 bucket files will not have to deal with 
de-duplication. 
   **Environment Description**
   
   * Hudi version :
   0.10.1
   * Spark version :
   3.1
   
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   Running AWS Glue 3.0 job using pyspark
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] gtwuser opened a new issue, #5880: Need help in removing duplicate commit files from S3 after upsert

Reply via email to