[I] Data loss after rollback: next commit merges against older base files instead of latest Parquet files [hudi]

via GitHub Fri, 06 Feb 2026 01:33:04 -0800


heshshark opened a new issue, #18105:
URL: https://github.com/apache/hudi/issues/18105

### Bug Description

**What happened:**
When writing a Hudi table using Flink, Parquet files generated by a previous
successful commit are no longer present（Root cause not yet identified）.

As a result, during the next commit, Hudi merges against an older Parquet
file from an earlier commit, instead of the immediate previous version.

This causes data from the missing commit to be lost.

A commit completes successfully

Parquet files for this commit are expected to exist

Before the next commit, these Parquet files are missing

The next commit merges based on an earlier Parquet version

**What you expected:**
For each commit written by Flink:

- Parquet files produced by the commit should exist until they are
explicitly replaced or cleaned

- The next commit should merge based on the immediately previous commit’s
Parquet files

- No earlier Parquet version should be used for merge unless explicitly
intended

**Steps to reproduce:**
Steps to reproduce are unclear
timeline
16:58 ~ 17:02
The job entered a restarting state.

17:08
Determined that the job could not recover quickly and latency was increasing.
Increased write task parallelism from 256 to 384 and restarted the job from
checkpoint.

17:19
Observed JobManager running out of memory with multiple full GCs.
Increased JobManager memory from 4 GB to 8 GB and restarted the job again.

17:30
Observed the job stuck in the write phase while consuming very little data.
Suspected this might be caused by parallelism changes, so reduced
parallelism from 384 to 256 and restarted the job from checkpoint again.

17:44
TaskManager OOM occurred.

17:45
Increased TaskManager memory from 8 GB to 12 GB, adjusted parallelism from
256 to 384, and restarted the job without checkpoint, consuming from
group-offsets.

18:04:02
A new instant 20260203180402381 was started.

18:04:03
A rollback of 20260203180402381 occurred, but a version with 7 parquet

files for instant 20260203180402381 was generated.

18:04:04
A new instant 20260203180404672 was started.

18:10:54
Flink triggered a checkpoint.

18:36:55
The checkpoint succeeded, and commit 20260203180404672 completed
successfully.
Metadata records that the above 7 files from 20260203180402381.parquet were
merged into 20260203180404672.parquet.
(According to S3 operation logs, 20260203180404672.parquet had delete
operations at 18:06 and 18:10.)

18:36:56
A new instant 20260203183656117 was started.

18:45:49
Flink triggered a checkpoint.

19:02:48
The checkpoint succeeded, and commit 20260203183656117 completed
successfully.
Metadata records that the above 7 files from 20260203160616418.parquet were
merged into 20260203183656117.parquet.
(It is suspected that Hudi merged against the latest successfully committed
base files at that time.
Since the 7-file version of 20260203180404672 could not be found and commit
20260203180402381 did not succeed, Hudi fell back to merging files from version
20260203160616418, which resulted in data loss.)

[flinkhudi数据丢失描述.md](https://github.com/user-attachments/files/25123651/flinkhudi.md)

[flinkhudi数据丢失日志文件相关.md](https://github.com/user-attachments/files/25123650/flinkhudi.md)

### Environment

**Hudi version:**
0.15.0
**Query engine:** (Spark/Flink/Trino etc)
flink
**Relevant configs:**
tabletype：cow
operation：upsert
index：bucket 128
keyclass：org.apache.hudi.keygen.ComplexAvroKeyGenerator

### Logs and Stack Trace

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Data loss after rollback: next commit merges against older base files instead of latest Parquet files [hudi]

Reply via email to