[GitHub] [hudi] Raghvendradubey opened a new issue, #9536: Duplicate Row in Same Partition using Global Bloom Index

via GitHub Fri, 25 Aug 2023 06:23:49 -0700


Raghvendradubey opened a new issue, #9536:
URL: https://github.com/apache/hudi/issues/9536


   Hi Team,
   
   I am facing an issue of duplicate record keys while data upserts into Hudi 
on EMR.
   
   Hudi Jar - 
   hudi-spark3.1.2-bundle_2.12-0.10.1.jar
   
   EMR Version - 
   emr-6.5.0
   
   Workflow - 
   files on S3 -> EMR(hudi) -> Hudi Tables(S3)
   
   Schedule - once in a day
   
   Insert Data Size - 
   5 to 10 MB per batch
   
   Hudi Configuration for Upsert - 
   
   hudi_options = {
               'hoodie.table.name': "txn_table"
               'hoodie.datasource.write.recordkey.field': "transaction_id",
               'hoodie.datasource.write.partitionpath.field': 'billing_date',
               'hoodie.datasource.write.table.name': "txn_table",
               'hoodie.datasource.write.operation': 'upsert',
               'hoodie.datasource.write.precombine.field': 'transaction_id',
               'hoodie.index.type': "GLOBAL_BLOOM",
               'hoodie.bloom.index.update.partition.path': "true",
               'hoodie.upsert.shuffle.parallelism': 10,
               'hoodie.insert.shuffle.parallelism': 10,
               'hoodie.datasource.hive_sync.database': "dwh",
               'hoodie.datasource.hive_sync.table': "txn_table",
               'hoodie.datasource.hive_sync.partition_fields': "billing_date",
               'hoodie.datasource.write.hive_style_partitioning': "true",
               'hoodie.datasource.hive_sync.enable': "true",
               
'hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled': 
"true",
               'hoodie.datasource.hive_sync.support_timestamp': "true",
               'hoodie.metadata.enable': "true"
           }
   
   Issue Occurrence - 
   It's been around a month while running our job in production but this issue 
has been seen for the first time.
   Even when I tried to reproduce the issue with the same dataset it was not 
reproducible, records updated successfully.
   
   Issue Steps  - 
   
   1 - There is a batch of data for which first we do insert in txn_table, 
which has unique id through out the partition i.e transaction_id(defined as 
record key)
   2 - Next day, on the update of the record key a new row is created with same 
record key in same partition with updated value.
   3 - both the duplicate rows were able to be read but when I try to update 
then it updates only the latest row.
   4 - On checking the parquet file, a duplicate record with updated value was 
present in a different file in the same partition.
   
   Steps to Reproduce - 
   
   Issue is not reproducible, even when same dataset tried to ingest again with 
same configuration Upsert was fine.
   
   Please let me know If I am missing some configuration.
   
   Thanks
   Raghvendra


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Raghvendradubey opened a new issue, #9536: Duplicate Row in Same Partition using Global Bloom Index

Reply via email to