litiliu opened a new issue, #2648:
URL: https://github.com/apache/fluss/issues/2648

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Fluss version
   
   0.8.0 (latest release)
   
   ### Please describe the bug 🐞
   
   Summary
   Fluss tiering jobs perform a “missing snapshot” check before committing to 
the lake (to avoid duplicates when Fluss missed a lake snapshot). However, if 
the heartbeat channel between the Flink tiering job and the Coordinator is 
unstable, the Coordinator can time out the job and reassign the same table to a 
new job. If both jobs pass the missing‑snapshot check and then commit 
concurrently, duplicate lake snapshots/data can be produced (especially for 
append‑only / non‑PK tables), because commit is not fenced by tieringEpoch.
   
   Environment
   
   Fluss version: 0.10-SNAPSHOT
   Lake storage: Iceberg/Paimon/Lance
   Flink tiering service: fluss-flink-tiering
   Network: heartbeat path flaky/partitioned; lake storage reachable
   Expected
   Once a table round is assigned, only one tiering job should be able to 
commit its lake snapshot. If a job is timed out and another job takes over, 
stale commits should be rejected.
   
   Actual
   Two tiering jobs can both commit a snapshot for the same round if they both 
pass the “missing snapshot” check before either commit is visible.
   
   Why this happens (race window)
   
   Job A is assigned table T (epoch=1), starts processing.
   Heartbeat to Coordinator is lost (network issue).
   Coordinator times out Job A and reassigns table T to Job B (epoch=2).
   Both Job A and Job B perform the missing snapshot check:
   Each sees “no lake snapshot missing from Fluss” (because neither has 
committed yet).
   Both proceed to commit to lake → duplicate snapshots/data possible.
   
   
   ```mermaid
   sequenceDiagram
       participant C as Fluss Coordinator
       participant A as Tiering Job A
       participant B as Tiering Job B
       participant L as Lake
   
       C->>A: assign T, epoch=1
       A-->>C: heartbeat ok
   
       Note over A,C: heartbeat link lost
       C->>C: timeout (2 min)
       C->>B: assign T, epoch=2
   
       A->>L: missing-snapshot check (no missing)
       B->>L: missing-snapshot check (no missing)
   
       A->>L: commit snapshot
       B->>L: commit snapshot
       Note over L: duplicate snapshots/data possible
   
   ```
   
   
   Impact
   
   Duplicate lake data/snapshots under heartbeat‑partition scenarios.
   Exactly‑once semantics can be violated for non‑PK/append‑only tables.
   
   ### Solution
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to