coffee34 opened a new issue, #8474:
URL: https://github.com/apache/hudi/issues/8474

   **Describe the problem you faced**
   
   I have observed that when upserting existing records to the same partition 
and experiencing an executor loss during the Building workload profile, which 
causes the stage to re-run, the record may be misclassified as an insert 
instead of an update. As a result, there will be 2 records with the same record 
key in the same partition. This issue only occurs when a Spark executor is lost 
and the stage is re-run; usually, there are no duplicates.
   <img width="1894" alt="Screen Shot 2023-04-17 at 11 25 34" 
src="https://user-images.githubusercontent.com/64056509/232363759-65c40f3b-443e-40fe-9c28-eba848616e85.png";>
   
   My hypothesis is that Hudi encounters an issue when re-running the 
tagLocation phase, which results in the failure to find the corresponding base 
file for that record.
   
   This has happened several times in our production environment after executor 
loss, but I have been unable to reproduce it in our staging environment. I have 
modified the Spark configuration to prevent executor loss, so the issue is not 
currently occurring.
   
   I would greatly appreciate it if you could provide some insight into why 
this might happen and any possible solutions or workarounds to address this 
issue.
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Here is the config used when writing to hudi
   ```
   --write-type upsert
   --load.hudi.record-key       id
   --load.options       {"hoodie.upsert.shuffle.parallelism":200}
   
   DataSourceWriteOptions.ASYNC_COMPACT_ENABLE -> false,
   DataSourceWriteOptions.HIVE_STYLE_PARTITIONING -> true,
   DataSourceWriteOptions.TABLE_TYPE -> 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL,
   FileSystemViewStorageConfig.INCREMENTAL_TIMELINE_SYNC_ENABLE -> false,
   HoodieCompactionConfig.CLEANER_FILE_VERSIONS_RETAINED -> 6, 
   HoodieCompactionConfig.CLEANER_POLICY -> 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS,
   HoodieCompactionConfig.INLINE_COMPACT -> true,
   HoodiePayloadConfig.EVENT_TIME_FIELD -> Columns.InternalTimestamp,
   HoodiePayloadConfig.ORDERING_FIELD -> Columns.InternalTimestamp,
   HoodieIndexConfig.INDEX_TYPE -> HoodieIndex.IndexType.BLOOM,
   HoodieWriteConfig.MARKERS_TYPE -> MarkerType.DIRECT,
   HoodieWriteConfig.ROLLBACK_USING_MARKERS_ENABLE -> false,
   HoodieWriteCommitCallbackConfig.TURN_CALLBACK_ON -> true
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to