[I] [SUPPORT] High number of duplicated records for certain commits [hudi]

via GitHub Mon, 23 Sep 2024 08:42:32 -0700


tped17 opened a new issue, #11989:
URL: https://github.com/apache/hudi/issues/11989


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
     - this link gives me a 404
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   We noticed an issue with two of our datasets wherein we have multiple rows 
with the same _hoodie_record_key, _hoodie_commit_time and _hoodie_commit_seqno 
within the same file. Unfortunately all of the problematic commits have been 
archived. Below is an example of the duplicate records (I've redacted the exact 
record key, but they are all the same), each sequence number is repeated 64 
times. 
   
   ```
   
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+
   |_hoodie_record_key|_hoodie_commit_time|_hoodie_file_name                    
                                          |_hoodie_commit_seqno        |count|
   
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360995|64
   |
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360996|64
   |
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360993|64
   |
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360994|64
   |
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360994|64
   |
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360995|64
   |
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360996|64
   |
   |XXXX  |20240515220256697  
|7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360993|64
   |
   
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+
   ```
   
   Here's the config we use:
   ```
   hoodie.parquet.small.file.limit -> 104857600
   hoodie.datasource.write.precombine.field -> eventVersion
   hoodie.datasource.write.payload.class -> 
org.apache.hudi.common.model.EmptyHoodieRecordPayload
   hoodie.bloom.index.filter.dynamic.max.entries -> 1106137
   hoodie.cleaner.fileversions.retained -> 2
   hoodie.parquet.max.file.size -> 134217728
   hoodie.cleaner.parallelism -> 1500
   hoodie.write.lock.client.num_retries -> 10
   hoodie.delete.shuffle.parallelism -> 1500
   hoodie.bloom.index.prune.by.ranges -> true
   hoodie.metadata.enable -> false
   hoodie.clean.automatic -> false
   hoodie.datasource.write.operation -> upsert
   hoodie.write.lock.wait_time_ms -> 600000
   hoodie.metrics.reporter.type -> CLOUDWATCH
   hoodie.datasource.write.recordkey.field -> 
timestamp,eventId,subType,trackedItem
   hoodie.table.name -> my_table_name
   hoodie.datasource.write.table.type -> COPY_ON_WRITE
   hoodie.datasource.write.hive_style_partitioning -> true
   hoodie.datasource.write.partitions.to.delete -> 
   hoodie.write.lock.dynamodb.partition_key -> my_table_name_key
   hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS
   hoodie.write.markers.type -> DIRECT
   hoodie.metrics.on -> false
   hoodie.datasource.write.reconcile.schema -> true
   hoodie.datasource.write.keygenerator.class -> 
org.apache.hudi.keygen.ComplexKeyGenerator
   hoodie.cleaner.policy.failed.writes -> LAZY
   hoodie.upsert.shuffle.parallelism -> 1500
   hoodie.write.lock.dynamodb.table -> HoodieLockTable
   hoodie.write.lock.provider -> 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   hoodie.datasource.write.partitionpath.field -> region,year,month,day,hour
   hoodie.bloom.index.filter.type -> DYNAMIC_V0
   hoodie.write.lock.wait_time_ms_between_retry -> 30000
   hoodie.write.concurrency.mode -> optimistic_concurrency_control
   hoodie.write.lock.dynamodb.region -> us-east-1
   ```
   **To Reproduce**
   We have not been able to reproduce this intentionally. This only happens 
occasionally in our dataset and it does not seem to follow any pattern that 
we've been able to discern.
   
   **Expected behavior**
   
   It is my understanding that we shouldn't be seeing a large number of 
duplicates per sequence number.
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.2.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   For the datasets in which we found the issue we run cleaning and clustering 
manually and I noticed that our lock keys were incorrectly configured on the 
cleaning/clustering jobs, so it is possible that we were running cleaning or 
clustering at the same time as data ingestion or deletion. Please let me know 
if you need any more info, thank you!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] High number of duplicated records for certain commits [hudi]

Reply via email to