heshshark opened a new issue, #18105: URL: https://github.com/apache/hudi/issues/18105
### Bug Description **What happened:** When writing a Hudi table using Flink, Parquet files generated by a previous successful commit are no longer present(Root cause not yet identified). As a result, during the next commit, Hudi merges against an older Parquet file from an earlier commit, instead of the immediate previous version. This causes data from the missing commit to be lost. A commit completes successfully Parquet files for this commit are expected to exist Before the next commit, these Parquet files are missing The next commit merges based on an earlier Parquet version **What you expected:** For each commit written by Flink: - Parquet files produced by the commit should exist until they are explicitly replaced or cleaned - The next commit should merge based on the immediately previous commit’s Parquet files - No earlier Parquet version should be used for merge unless explicitly intended **Steps to reproduce:** Steps to reproduce are unclear timeline 16:58 ~ 17:02 The job entered a restarting state. 17:08 Determined that the job could not recover quickly and latency was increasing. Increased write task parallelism from 256 to 384 and restarted the job from checkpoint. 17:19 Observed JobManager running out of memory with multiple full GCs. Increased JobManager memory from 4 GB to 8 GB and restarted the job again. 17:30 Observed the job stuck in the write phase while consuming very little data. Suspected this might be caused by parallelism changes, so reduced parallelism from 384 to 256 and restarted the job from checkpoint again. 17:44 TaskManager OOM occurred. 17:45 Increased TaskManager memory from 8 GB to 12 GB, adjusted parallelism from 256 to 384, and restarted the job without checkpoint, consuming from group-offsets. 18:04:02 A new instant 20260203180402381 was started. 18:04:03 A rollback of 20260203180402381 occurred, but a version with 7 parquet files for instant 20260203180402381 was generated. 18:04:04 A new instant 20260203180404672 was started. 18:10:54 Flink triggered a checkpoint. 18:36:55 The checkpoint succeeded, and commit 20260203180404672 completed successfully. Metadata records that the above 7 files from 20260203180402381.parquet were merged into 20260203180404672.parquet. (According to S3 operation logs, 20260203180404672.parquet had delete operations at 18:06 and 18:10.) 18:36:56 A new instant 20260203183656117 was started. 18:45:49 Flink triggered a checkpoint. 19:02:48 The checkpoint succeeded, and commit 20260203183656117 completed successfully. Metadata records that the above 7 files from 20260203160616418.parquet were merged into 20260203183656117.parquet. (It is suspected that Hudi merged against the latest successfully committed base files at that time. Since the 7-file version of 20260203180404672 could not be found and commit 20260203180402381 did not succeed, Hudi fell back to merging files from version 20260203160616418, which resulted in data loss.) [flinkhudi数据丢失描述.md](https://github.com/user-attachments/files/25123651/flinkhudi.md) [flinkhudi数据丢失日志文件相关.md](https://github.com/user-attachments/files/25123650/flinkhudi.md) ### Environment **Hudi version:** 0.15.0 **Query engine:** (Spark/Flink/Trino etc) flink **Relevant configs:** tabletype:cow operation:upsert index:bucket 128 keyclass:org.apache.hudi.keygen.ComplexAvroKeyGenerator ### Logs and Stack Trace _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
