[I] [SUPPORT] Duplicates while partition is being updated [hudi]

via GitHub Wed, 16 Apr 2025 03:35:10 -0700


Hfal91 opened a new issue, #13153:
URL: https://github.com/apache/hudi/issues/13153


   **Describe the problem you faced**
   
   If table is queried while a writing job is running - in which partition 
field is updated - there's a brief moment in which the table returns duplicates.
   
   It seems to me that this happens in the moment where new version of the 
record was created in the new partition, and the old version was still not 
removed from the old partition..
   
   When the job finishes, the table does not return duplicates.
   
   Is there a way to solve it in this version of Hudi (v0.14.1) - or was it 
already solved in newer versions?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Have a big table partitioned by a specific field
   2. Run a job that will update the partitioned field
   3. Query the table (in my case using Athena) - you may need to query several 
times until it gets to the moment in which it returns duplicates
   
   Relevant options used:
       'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.index.type': 'RECORD_INDEX',
       'hoodie.record.index.update.partition.path' = 'true',
       'hoodie.compact.inline.max.delta.commits':'1'
   
   **Expected behavior**
   
   To not get duplicates at any time
   
   **Environment Description**
   
   * Hudi version : v0.14.1
   
   * Spark version : 3.5.1
   
   * Hive version : 3.1.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Duplicates while partition is being updated [hudi]

Reply via email to