Raghvendradubey opened a new issue, #9536:
URL: https://github.com/apache/hudi/issues/9536
Hi Team,
I am facing an issue of duplicate record keys while data upserts into Hudi
on EMR.
Hudi Jar -
hudi-spark3.1.2-bundle_2.12-0.10.1.jar
EMR Version -
emr-6.5.0
Workflow -
files on S3 -> EMR(hudi) -> Hudi Tables(S3)
Schedule - once in a day
Insert Data Size -
5 to 10 MB per batch
Hudi Configuration for Upsert -
hudi_options = {
'hoodie.table.name': "txn_table"
'hoodie.datasource.write.recordkey.field': "transaction_id",
'hoodie.datasource.write.partitionpath.field': 'billing_date',
'hoodie.datasource.write.table.name': "txn_table",
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'transaction_id',
'hoodie.index.type': "GLOBAL_BLOOM",
'hoodie.bloom.index.update.partition.path': "true",
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.datasource.hive_sync.database': "dwh",
'hoodie.datasource.hive_sync.table': "txn_table",
'hoodie.datasource.hive_sync.partition_fields': "billing_date",
'hoodie.datasource.write.hive_style_partitioning': "true",
'hoodie.datasource.hive_sync.enable': "true",
'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled':
"true",
'hoodie.datasource.hive_sync.support_timestamp': "true",
'hoodie.metadata.enable': "true"
}
Issue Occurrence -
It's been around a month while running our job in production but this issue
has been seen for the first time.
Even when I tried to reproduce the issue with the same dataset it was not
reproducible, records updated successfully.
Issue Steps -
1 - There is a batch of data for which first we do insert in txn_table,
which has unique id through out the partition i.e transaction_id(defined as
record key)
2 - Next day, on the update of the record key a new row is created with same
record key in same partition with updated value.
3 - both the duplicate rows were able to be read but when I try to update
then it updates only the latest row.
4 - On checking the parquet file, a duplicate record with updated value was
present in a different file in the same partition.
Steps to Reproduce -
Issue is not reproducible, even when same dataset tried to ingest again with
same configuration Upsert was fine.
Please let me know If I am missing some configuration.
Thanks
Raghvendra
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]