tped17 opened a new issue, #11989: URL: https://github.com/apache/hudi/issues/11989
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - this link gives me a 404 - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** We noticed an issue with two of our datasets wherein we have multiple rows with the same _hoodie_record_key, _hoodie_commit_time and _hoodie_commit_seqno within the same file. Unfortunately all of the problematic commits have been archived. Below is an example of the duplicate records (I've redacted the exact record key, but they are all the same), each sequence number is repeated 64 times. ``` +--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+ |_hoodie_record_key|_hoodie_commit_time|_hoodie_file_name |_hoodie_commit_seqno |count| +--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+ |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360995|64 | |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360996|64 | |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360993|64 | |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360994|64 | |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360994|64 | |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360995|64 | |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360996|64 | |XXXX |20240515220256697 |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360993|64 | +--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+ ``` Here's the config we use: ``` hoodie.parquet.small.file.limit -> 104857600 hoodie.datasource.write.precombine.field -> eventVersion hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.EmptyHoodieRecordPayload hoodie.bloom.index.filter.dynamic.max.entries -> 1106137 hoodie.cleaner.fileversions.retained -> 2 hoodie.parquet.max.file.size -> 134217728 hoodie.cleaner.parallelism -> 1500 hoodie.write.lock.client.num_retries -> 10 hoodie.delete.shuffle.parallelism -> 1500 hoodie.bloom.index.prune.by.ranges -> true hoodie.metadata.enable -> false hoodie.clean.automatic -> false hoodie.datasource.write.operation -> upsert hoodie.write.lock.wait_time_ms -> 600000 hoodie.metrics.reporter.type -> CLOUDWATCH hoodie.datasource.write.recordkey.field -> timestamp,eventId,subType,trackedItem hoodie.table.name -> my_table_name hoodie.datasource.write.table.type -> COPY_ON_WRITE hoodie.datasource.write.hive_style_partitioning -> true hoodie.datasource.write.partitions.to.delete -> hoodie.write.lock.dynamodb.partition_key -> my_table_name_key hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS hoodie.write.markers.type -> DIRECT hoodie.metrics.on -> false hoodie.datasource.write.reconcile.schema -> true hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.ComplexKeyGenerator hoodie.cleaner.policy.failed.writes -> LAZY hoodie.upsert.shuffle.parallelism -> 1500 hoodie.write.lock.dynamodb.table -> HoodieLockTable hoodie.write.lock.provider -> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider hoodie.datasource.write.partitionpath.field -> region,year,month,day,hour hoodie.bloom.index.filter.type -> DYNAMIC_V0 hoodie.write.lock.wait_time_ms_between_retry -> 30000 hoodie.write.concurrency.mode -> optimistic_concurrency_control hoodie.write.lock.dynamodb.region -> us-east-1 ``` **To Reproduce** We have not been able to reproduce this intentionally. This only happens occasionally in our dataset and it does not seem to follow any pattern that we've been able to discern. **Expected behavior** It is my understanding that we shouldn't be seeing a large number of duplicates per sequence number. **Environment Description** * Hudi version : 0.11.1 * Spark version : 3.2.1 * Hive version : 3.1.3 * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** For the datasets in which we found the issue we run cleaning and clustering manually and I noticed that our lock keys were incorrectly configured on the cleaning/clustering jobs, so it is possible that we were running cleaning or clustering at the same time as data ingestion or deletion. Please let me know if you need any more info, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
