davidzollo opened a new issue, #10926:
URL: https://github.com/apache/seatunnel/issues/10926

   ### Search before asking
   
   - [x] I had searched in the existing issues and found related restore 
discussions, but not this exact Zeta capability gap.
   - Related but different issues:
     - `#10193`: Flink engine checkpoint/savepoint restore compatibility
     - `#10574`: sink-side data loss during restore after failure
     - `#10899`: MySQL CDC timestamp startup mode cannot recover from 
checkpoints
   
   ### What happened
   
   In a scheduler / orchestrator environment, a long-running Zeta streaming job 
may be **force-stopped from outside the engine** (for example: task kill, Pod 
restart, scheduler-side takeover failure, or other non-graceful interruption).
   
   Even when the job has already completed multiple checkpoints successfully, 
there does not appear to be a clear and supported Zeta-side recovery path to 
**resume the next run from the latest successful checkpoint**.
   
   In practice, operators often have to treat the next start as a fresh 
submission, which weakens the operational value of checkpointing for streaming 
jobs.
   
   ### Why this matters
   
   This is especially painful for CDC-style long-running jobs:
   
   - re-running from scratch can repeat snapshot work
   - resubmitting without a restore contract can create replay / duplicate / 
manual-reconciliation risks
   - external schedulers cannot reliably implement "force stop -> restart from 
latest checkpoint" semantics on top of Zeta today
   
   Checkpointing is already active during runtime, so after a forced stop, 
operators naturally expect to be able to recover from the latest successful 
checkpoint rather than start over.
   
   ### Steps to reproduce
   
   1. Submit a Zeta streaming job with checkpointing enabled.
   2. Wait until several checkpoints complete successfully.
   3. Force-stop the running task from the scheduler / orchestration layer, 
without a graceful savepoint-style stop.
   4. Start the job again.
   5. Observe that there is no clear supported mechanism to bind the new run to 
the latest successful checkpoint of the previous run.
   
   ### Expected behavior
   
   Zeta should provide a clear and supported recovery contract for this 
scenario.
   
   Ideally, after a streaming job has completed checkpoint `N`, and the running 
task is force-stopped externally, the next restore / restart / resubmission 
should be able to:
   
   1. discover the latest successful checkpoint for that job
   2. resume from checkpoint `N` instead of behaving like a fresh submission
   3. make the restore path explicit for sources / transforms / sinks
   4. define fallback behavior when the latest checkpoint is missing, 
incompatible, or already cleaned up
   
   If this is already supported today, then the current documentation and 
operator-facing workflow are not clear enough and should be documented 
explicitly.
   
   ### Actual behavior
   
   From the operator side, the practical behavior looks like a capability gap:
   
   - checkpoints are being completed during runtime
   - but after a forced stop, the next run does not have a clear official 
resume-from-latest-checkpoint path
   - the burden is pushed to the external scheduler / operator to guess how 
recovery should work
   
   ### Relevant evidence
   
   A few implementation details suggest that most of the building blocks 
already exist, but the end-to-end operator-facing resume contract is incomplete 
or unclear:
   
   - Zeta runtime already has checkpointing / savepoint-related job states
   - source translation modules already contain snapshot / restore logic, such 
as `CoordinatedSource` and `ParallelSource`
   - the missing part seems to be the engine-level / job-lifecycle-level 
recovery path for "force stop -> restart from latest successful checkpoint"
   
   ### SeaTunnel Version
   
   Observed from a SeaTunnel-based Zeta deployment in the 2.6.x line.
   
   I am filing this as a Zeta capability request / gap report. If maintainers 
believe this is already supported upstream, please point to the recommended 
recovery workflow.
   
   ### SeaTunnel Config
   
   A representative streaming setup looks like this:
   
   ```conf
   env {
     job.mode = "STREAMING"
     parallelism = 1
     checkpoint.interval = 10000
   }
   
   source {
     SqlServer-CDC {
       startup.mode = "INITIAL"
       database-names = ["demo_db"]
       table-names = ["demo_table"]
       base-url = "jdbc:sqlserver://<host>:1433;databaseName=<db>"
       username = "<user>"
       password = "<password>"
     }
   }
   
   sink {
     Iceberg {
       namespace = demo
       table = "${table_name}"
       catalog_name = default
     }
   }
   ```
   
   The exact connector combination should not be the core point here. The 
request is about the Zeta job-lifecycle recovery contract after a forced stop.
   
   ### Running Command
   
   ```shell
   ./bin/seatunnel-cluster.sh -d
   ```
   
   ### Error Exception
   
   ```log
   There may be no single canonical engine exception for this scenario.
   The core problem is the lack of a clear supported restore path from the 
latest successful checkpoint after a forced stop.
   ```
   
   ### Zeta or Flink or Spark Version
   
   Zeta
   
   ### Java or Scala Version
   
   Java
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's Code of Conduct
   
   ### Contribution Claim
   
   If this direction makes sense, contributors are welcome to leave a comment 
to claim or discuss the implementation direction.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to