coffee34 opened a new issue, #8474: URL: https://github.com/apache/hudi/issues/8474
**Describe the problem you faced** I have observed that when upserting existing records to the same partition and experiencing an executor loss during the Building workload profile, which causes the stage to re-run, the record may be misclassified as an insert instead of an update. As a result, there will be 2 records with the same record key in the same partition. This issue only occurs when a Spark executor is lost and the stage is re-run; usually, there are no duplicates. <img width="1894" alt="Screen Shot 2023-04-17 at 11 25 34" src="https://user-images.githubusercontent.com/64056509/232363759-65c40f3b-443e-40fe-9c28-eba848616e85.png"> My hypothesis is that Hudi encounters an issue when re-running the tagLocation phase, which results in the failure to find the corresponding base file for that record. This has happened several times in our production environment after executor loss, but I have been unable to reproduce it in our staging environment. I have modified the Spark configuration to prevent executor loss, so the issue is not currently occurring. I would greatly appreciate it if you could provide some insight into why this might happen and any possible solutions or workarounds to address this issue. **Environment Description** * Hudi version : 0.11.1 * Spark version : 3.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Here is the config used when writing to hudi ``` --write-type upsert --load.hudi.record-key id --load.options {"hoodie.upsert.shuffle.parallelism":200} DataSourceWriteOptions.ASYNC_COMPACT_ENABLE -> false, DataSourceWriteOptions.HIVE_STYLE_PARTITIONING -> true, DataSourceWriteOptions.TABLE_TYPE -> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, FileSystemViewStorageConfig.INCREMENTAL_TIMELINE_SYNC_ENABLE -> false, HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED -> 6, HoodieCompactionConfig.CLEANER_POLICY -> HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS, HoodieCompactionConfig.INLINE_COMPACT -> true, HoodiePayloadConfig.EVENT_TIME_FIELD -> Columns.InternalTimestamp, HoodiePayloadConfig.ORDERING_FIELD -> Columns.InternalTimestamp, HoodieIndexConfig.INDEX_TYPE -> HoodieIndex.IndexType.BLOOM, HoodieWriteConfig.MARKERS_TYPE -> MarkerType.DIRECT, HoodieWriteConfig.ROLLBACK_USING_MARKERS_ENABLE -> false, HoodieWriteCommitCallbackConfig.TURN_CALLBACK_ON -> true ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
