litiliu opened a new issue, #2648: URL: https://github.com/apache/fluss/issues/2648
### Search before asking - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar. ### Fluss version 0.8.0 (latest release) ### Please describe the bug 🐞 Summary Fluss tiering jobs perform a “missing snapshot” check before committing to the lake (to avoid duplicates when Fluss missed a lake snapshot). However, if the heartbeat channel between the Flink tiering job and the Coordinator is unstable, the Coordinator can time out the job and reassign the same table to a new job. If both jobs pass the missing‑snapshot check and then commit concurrently, duplicate lake snapshots/data can be produced (especially for append‑only / non‑PK tables), because commit is not fenced by tieringEpoch. Environment Fluss version: 0.10-SNAPSHOT Lake storage: Iceberg/Paimon/Lance Flink tiering service: fluss-flink-tiering Network: heartbeat path flaky/partitioned; lake storage reachable Expected Once a table round is assigned, only one tiering job should be able to commit its lake snapshot. If a job is timed out and another job takes over, stale commits should be rejected. Actual Two tiering jobs can both commit a snapshot for the same round if they both pass the “missing snapshot” check before either commit is visible. Why this happens (race window) Job A is assigned table T (epoch=1), starts processing. Heartbeat to Coordinator is lost (network issue). Coordinator times out Job A and reassigns table T to Job B (epoch=2). Both Job A and Job B perform the missing snapshot check: Each sees “no lake snapshot missing from Fluss” (because neither has committed yet). Both proceed to commit to lake → duplicate snapshots/data possible. ```mermaid sequenceDiagram participant C as Fluss Coordinator participant A as Tiering Job A participant B as Tiering Job B participant L as Lake C->>A: assign T, epoch=1 A-->>C: heartbeat ok Note over A,C: heartbeat link lost C->>C: timeout (2 min) C->>B: assign T, epoch=2 A->>L: missing-snapshot check (no missing) B->>L: missing-snapshot check (no missing) A->>L: commit snapshot B->>L: commit snapshot Note over L: duplicate snapshots/data possible ``` Impact Duplicate lake data/snapshots under heartbeat‑partition scenarios. Exactly‑once semantics can be violated for non‑PK/append‑only tables. ### Solution _No response_ ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
