gtwuser opened a new issue, #5880:
URL: https://github.com/apache/hudi/issues/5880
**Describe the problem you faced**
After upsert records are duplicated across commits in s3. Since S3 is the
source for downstream we need to remove the old commits which are in form of
parquet files and retain only the latest commits.
A clear and concise description of the problem.
Configs
Its using default `COW`
```
commonConfig = {
'className': 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
# 'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.datasource.write.precombine.field': 'ModTime',
'hoodie.datasource.write.recordkey.field': 'Moid',
'hoodie.table.name': 'intersight',
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.hive_sync.database': args['database_name'],
'hoodie.datasource.write.reconcile.schema': 'true',
'hoodie.datasource.hive_sync.table': 'intersight' +
prefix.replace("/", "_").lower(),
'hoodie.datasource.hive_sync.enable': 'true', 'path': 's3://' +
args['curated_bucket'] + '/intersight' + prefix,
'hoodie.parquet.small.file.limit': '134217728' # 1,024 * 1,024 *
128 = 134,217,728 (128 MB)
}
incrementalConfig = {
'hoodie.upsert.shuffle.parallelism': 68,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
'hoodie.cleaner.commits.retained': 10
}
```
saved with :
```bash
inputDf.write \
.format('org.apache.hudi') \
.option("spark.hadoop.parquet.avro.write-old-list-structure",
"false") \
.option("parquet.avro.write-old-list-structure", "false") \
.option("spark.hadoop.parquet.avro.add-list-element-records",
"false") \
.option("parquet.avro.add-list-element-records", "false") \
.option("hoodie.parquet.avro.write-old-list-structure", "false") \
.option("hoodie.datasource.write.reconcile.schema", "true") \
.options(**combinedConf) \
.mode('append') \
.save()
```
After insert when upsert is performed we see records which were inserted
earlier are being still retained in the S3 bucket. Due to this we have multiple
duplicate records now in the bucket.
**To Reproduce**
Steps to reproduce the behavior:
1. Insert with following records:
```
{"name":"rohan", "id":1}
{"name":"rakesh", "id":2}
```
2. This insertion created 2 parquet files, with one record per file
`file1.parquet` - `{"name":"rohan", "id":1}`
`file2.parquet` - `{"name":"rakesh", "id":2}`
3. Add a new record and update existing one, something like this :
~{"name":"rohan", "id":1}~ -> {"name":"rohan123", "id":1} # updated record
{"name":"ram", "id":3} # new record
4. This upsert creates another 2 new files, so now in total we have 4 files:
`file1.parquet` - `{"name":"rohan", "id":1}`# existing old files `not
deleted yet` (these are the files which leads to duplicate records)
`file2.parquet` - `{"name":"rakesh", "id":2}` # existing old files `not
deleted yet` (these are the files which leads to duplicate records)
`file3.parquet` - `{"name":"rohan123", "id":1} {"name":"rakesh", "id":2}` #
this the updated record and another old record is merged to this `file which is
created on upsert`
`file4.parquet` - `{"name":"ram", "id":3}` # newly added record is added to
this file.
5. Now our issue is since `file1.parquet`, `file2.parquet` and
`file3.parquet` has duplicate records the consumer is getting these `duplicate
records` which is invalid.
**Expected behavior**
Ideally there should not be `file1.parquet` and `file2.parquet` files after
upsert. `file3.parquet` contains the latest changes in it for both records.
A clear and concise description of what you expected to happen.
With every update the files(old ones) should be deleted or marked as
deleted, so that the consumer of s3 bucket files will not have to deal with
de-duplication.
**Environment Description**
* Hudi version :
0.10.1
* Spark version :
3.1
* Running on Docker? (yes/no) :
no
**Additional context**
Running AWS Glue 3.0 job using pyspark
Add any other context about the problem here.
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]