lrsb opened a new pull request, #1135:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1135
## What is the purpose of the change
This pull request guards against orphaned/duplicate session jobs. FLINK-38858
already prevents duplicates when the recorded `JobID` survives a failed
submission
(it reuses the `JobID`, and Flink rejects a same-`JobID` resubmit). A
residual
window remains where the operator generates a **new** `JobID` while a job
from a
previous submission is still running on the session cluster - e.g. the submit
succeeded on the cluster but the `JobID` was never durably persisted, or the
`FlinkSessionJob` CR was deleted and recreated against the long-lived session
cluster. The old job is then left orphaned, and two jobs run concurrently
against
the same sources/sinks, breaking exactly-once semantics.
This change adds an opt-in safeguard: before a fresh submission, the operator
cancels any matching job already running on the session cluster.
## Brief change log
- Added config option
`kubernetes.operator.session-job.cancel-orphaned-on-submit`
(Boolean, default `false`)
- Default `pipeline.name` to the namespace-qualified `FlinkSessionJob`
resource
name so a session job has a stable identity on the cluster
- `SessionJobReconciler.deploy` lists non-terminal jobs on the session
cluster
and, when the option is enabled and a new `JobID` is being generated (not
reusing an existing one), cancels matching jobs and waits for them to
reach a
terminal state before submitting
- Cancelled `JobID`s are logged at WARN
## Verifying this change
This change added tests and can be verified as follows:
- Added a controller/reconciler test: with an orphaned job present on the
session
cluster and the option enabled, the orphan is cancelled before the new
submission; with the option disabled, behavior is unchanged
- Existing session job submission and `JobID`-reuse tests continue to pass,
confirming no behavior change when the option is off
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changes to the `CustomResourceDescriptors`:
no
- Core observer or reconciler logic that is regularly executed: yes
(`SessionJobReconciler.deploy`)
## Documentation
- Does this pull request introduce a new feature? yes (opt-in, default off)
- If yes, how is the feature documented? docs (generated config option
reference)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]