davidzollo opened a new issue, #10926:
URL: https://github.com/apache/seatunnel/issues/10926
### Search before asking
- [x] I had searched in the existing issues and found related restore
discussions, but not this exact Zeta capability gap.
- Related but different issues:
- `#10193`: Flink engine checkpoint/savepoint restore compatibility
- `#10574`: sink-side data loss during restore after failure
- `#10899`: MySQL CDC timestamp startup mode cannot recover from
checkpoints
### What happened
In a scheduler / orchestrator environment, a long-running Zeta streaming job
may be **force-stopped from outside the engine** (for example: task kill, Pod
restart, scheduler-side takeover failure, or other non-graceful interruption).
Even when the job has already completed multiple checkpoints successfully,
there does not appear to be a clear and supported Zeta-side recovery path to
**resume the next run from the latest successful checkpoint**.
In practice, operators often have to treat the next start as a fresh
submission, which weakens the operational value of checkpointing for streaming
jobs.
### Why this matters
This is especially painful for CDC-style long-running jobs:
- re-running from scratch can repeat snapshot work
- resubmitting without a restore contract can create replay / duplicate /
manual-reconciliation risks
- external schedulers cannot reliably implement "force stop -> restart from
latest checkpoint" semantics on top of Zeta today
Checkpointing is already active during runtime, so after a forced stop,
operators naturally expect to be able to recover from the latest successful
checkpoint rather than start over.
### Steps to reproduce
1. Submit a Zeta streaming job with checkpointing enabled.
2. Wait until several checkpoints complete successfully.
3. Force-stop the running task from the scheduler / orchestration layer,
without a graceful savepoint-style stop.
4. Start the job again.
5. Observe that there is no clear supported mechanism to bind the new run to
the latest successful checkpoint of the previous run.
### Expected behavior
Zeta should provide a clear and supported recovery contract for this
scenario.
Ideally, after a streaming job has completed checkpoint `N`, and the running
task is force-stopped externally, the next restore / restart / resubmission
should be able to:
1. discover the latest successful checkpoint for that job
2. resume from checkpoint `N` instead of behaving like a fresh submission
3. make the restore path explicit for sources / transforms / sinks
4. define fallback behavior when the latest checkpoint is missing,
incompatible, or already cleaned up
If this is already supported today, then the current documentation and
operator-facing workflow are not clear enough and should be documented
explicitly.
### Actual behavior
From the operator side, the practical behavior looks like a capability gap:
- checkpoints are being completed during runtime
- but after a forced stop, the next run does not have a clear official
resume-from-latest-checkpoint path
- the burden is pushed to the external scheduler / operator to guess how
recovery should work
### Relevant evidence
A few implementation details suggest that most of the building blocks
already exist, but the end-to-end operator-facing resume contract is incomplete
or unclear:
- Zeta runtime already has checkpointing / savepoint-related job states
- source translation modules already contain snapshot / restore logic, such
as `CoordinatedSource` and `ParallelSource`
- the missing part seems to be the engine-level / job-lifecycle-level
recovery path for "force stop -> restart from latest successful checkpoint"
### SeaTunnel Version
Observed from a SeaTunnel-based Zeta deployment in the 2.6.x line.
I am filing this as a Zeta capability request / gap report. If maintainers
believe this is already supported upstream, please point to the recommended
recovery workflow.
### SeaTunnel Config
A representative streaming setup looks like this:
```conf
env {
job.mode = "STREAMING"
parallelism = 1
checkpoint.interval = 10000
}
source {
SqlServer-CDC {
startup.mode = "INITIAL"
database-names = ["demo_db"]
table-names = ["demo_table"]
base-url = "jdbc:sqlserver://<host>:1433;databaseName=<db>"
username = "<user>"
password = "<password>"
}
}
sink {
Iceberg {
namespace = demo
table = "${table_name}"
catalog_name = default
}
}
```
The exact connector combination should not be the core point here. The
request is about the Zeta job-lifecycle recovery contract after a forced stop.
### Running Command
```shell
./bin/seatunnel-cluster.sh -d
```
### Error Exception
```log
There may be no single canonical engine exception for this scenario.
The core problem is the lack of a clear supported restore path from the
latest successful checkpoint after a forced stop.
```
### Zeta or Flink or Spark Version
Zeta
### Java or Scala Version
Java
### Screenshots
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit PR!
### Code of Conduct
- [x] I agree to follow this project's Code of Conduct
### Contribution Claim
If this direction makes sense, contributors are welcome to leave a comment
to claim or discuss the implementation direction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]