heshshark opened a new issue, #18105:
URL: https://github.com/apache/hudi/issues/18105

   ### Bug Description
   
   **What happened:**
   When writing a Hudi table using Flink, Parquet files generated by a previous 
successful commit are no longer present(Root cause not yet identified).
   
   As a result, during the next commit, Hudi merges against an older Parquet 
file from an earlier commit, instead of the immediate previous version.
   
   This causes data from the missing commit to be lost.
   
   A commit completes successfully
   
   Parquet files for this commit are expected to exist
   
   Before the next commit, these Parquet files are missing
   
   The next commit merges based on an earlier Parquet version
   
   **What you expected:**
   For each commit written by Flink:
   
   - Parquet files produced by the commit should exist until they are 
explicitly replaced or cleaned
   
   - The next commit should merge based on the immediately previous commit’s 
Parquet files
   
   - No earlier Parquet version should be used for merge unless explicitly 
intended
   
   **Steps to reproduce:**
   Steps to reproduce are unclear
   timeline
   16:58 ~ 17:02
   The job entered a restarting state.
   
   17:08
   Determined that the job could not recover quickly and latency was increasing.
   Increased write task parallelism from 256 to 384 and restarted the job from 
checkpoint.
   
   17:19
   Observed JobManager running out of memory with multiple full GCs.
   Increased JobManager memory from 4 GB to 8 GB and restarted the job again.
   
   17:30
   Observed the job stuck in the write phase while consuming very little data.
   Suspected this might be caused by parallelism changes, so reduced 
parallelism from 384 to 256 and restarted the job from checkpoint again.
   
   17:44
   TaskManager OOM occurred.
   
   17:45
   Increased TaskManager memory from 8 GB to 12 GB, adjusted parallelism from 
256 to 384, and restarted the job without checkpoint, consuming from 
group-offsets.
   
   18:04:02
   A new instant 20260203180402381 was started.
   
   18:04:03
   A rollback of 20260203180402381 occurred, but a version with 7 parquet 
   
   files for instant 20260203180402381 was generated.
   
   18:04:04
   A new instant 20260203180404672 was started.
   
   18:10:54
   Flink triggered a checkpoint.
   
   18:36:55
   The checkpoint succeeded, and commit 20260203180404672 completed 
successfully.
   Metadata records that the above 7 files from 20260203180402381.parquet were 
merged into 20260203180404672.parquet.
   (According to S3 operation logs, 20260203180404672.parquet had delete 
operations at 18:06 and 18:10.)
   
   18:36:56
   A new instant 20260203183656117 was started.
   
   18:45:49
   Flink triggered a checkpoint.
   
   19:02:48
   The checkpoint succeeded, and commit 20260203183656117 completed 
successfully.
   Metadata records that the above 7 files from 20260203160616418.parquet were 
merged into 20260203183656117.parquet.
   (It is suspected that Hudi merged against the latest successfully committed 
base files at that time.
   Since the 7-file version of 20260203180404672 could not be found and commit 
20260203180402381 did not succeed, Hudi fell back to merging files from version 
20260203160616418, which resulted in data loss.)
   
   
   
[flinkhudi数据丢失描述.md](https://github.com/user-attachments/files/25123651/flinkhudi.md)
   
[flinkhudi数据丢失日志文件相关.md](https://github.com/user-attachments/files/25123650/flinkhudi.md)
   
   ### Environment
   
   **Hudi version:**
   0.15.0
   **Query engine:** (Spark/Flink/Trino etc)
   flink
   **Relevant configs:**
   tabletype:cow
   operation:upsert
   index:bucket 128
   keyclass:org.apache.hudi.keygen.ComplexAvroKeyGenerator
   
   
   
   ### Logs and Stack Trace
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to