lrsb opened a new pull request, #1135:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1135

   ## What is the purpose of the change
   
   This pull request guards against orphaned/duplicate session jobs. FLINK-38858
   already prevents duplicates when the recorded `JobID` survives a failed 
submission
   (it reuses the `JobID`, and Flink rejects a same-`JobID` resubmit). A 
residual
   window remains where the operator generates a **new** `JobID` while a job 
from a
   previous submission is still running on the session cluster - e.g. the submit
   succeeded on the cluster but the `JobID` was never durably persisted, or the
   `FlinkSessionJob` CR was deleted and recreated against the long-lived session
   cluster. The old job is then left orphaned, and two jobs run concurrently 
against
   the same sources/sinks, breaking exactly-once semantics.
   
   This change adds an opt-in safeguard: before a fresh submission, the operator
   cancels any matching job already running on the session cluster.
   
   ## Brief change log
   
     - Added config option 
`kubernetes.operator.session-job.cancel-orphaned-on-submit`
       (Boolean, default `false`)
     - Default `pipeline.name` to the namespace-qualified `FlinkSessionJob` 
resource
       name so a session job has a stable identity on the cluster
     - `SessionJobReconciler.deploy` lists non-terminal jobs on the session 
cluster
       and, when the option is enabled and a new `JobID` is being generated (not
       reusing an existing one), cancels matching jobs and waits for them to 
reach a
       terminal state before submitting
     - Cancelled `JobID`s are logged at WARN
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added a controller/reconciler test: with an orphaned job present on the 
session
       cluster and the option enabled, the orphan is cancelled before the new
       submission; with the option disabled, behavior is unchanged
     - Existing session job submission and `JobID`-reuse tests continue to pass,
       confirming no behavior change when the option is off
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes
       (`SessionJobReconciler.deploy`)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes (opt-in, default off)
     - If yes, how is the feature documented? docs (generated config option 
reference)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to