Hi Yanis,
Thanks for bringing this up. I didn’t get to this before but I took a crack at
it, allow me to make some observations. (BTW, have you already started work
on/completed this?)
The problem statements and root causes are accurate, perhaps a few items in the
expected behavior need adjustment. I’m also adding some edge cases and
outlining potential work items. Please let me know your thoughts.
Clarifications:
- FlinkDeployment: "A subsequent change to running triggers a normal first
deployment” is incorrect. Once lastReconciledSpec is written for the suspended
initial state, isBeforeFirstDeployment() is permanently false. The running
change is handled by
the upgrade path, not the first-deployment path. Maybe it should be described
as a STATELESS upgrade.
- FlinkBlueGreenDeployment: instead of "observedGeneration set on every
status update”, it should only advance when a spec generation is fully
reconciled (i.e., when lastReconciledSpec is written), not on transition
progress patches. We’re probably referring to the same thing, just clarifying.
Missing details or edge cases:
- FlinkDeployment:
- If job.initialSavepointPath is set alongside job.state: suspended, it
will be silently lost when the user later changes to running. The
first-deployment path normally handles this by copying it to
upgradeSavepointPath; the upgrade path does not.
- updateStatusForSpecReconciliation() automatically calls
markReconciledSpecAsStable() when the spec's job state is SUSPENDED.
Acknowledging a suspended initial spec should result in
reconciliationStatus.state = STABLE, not UPGRADING or DEPLOYED, right?
- Error semantics change: Currently, if there’s an error
isBeforeFirstDeployment() = true, getLifecycleState() -> FAILED. After this
fix, with lastReconciledSpec set to SUSPENDED, an error would leave
lifecycleState as SUSPENDED, because the SUSPENDED check runs
before the job-failed/JM-error checks. Let’ decide if errors during the
suspended-initial phase should surface as FAILED or SUSPENDED?
- FlinkBlueGreenDeployment:
- FlinkBlueGreenDeploymentStatus doesn't extend CommonStatus by design.
This is not a problem, we just add the property and the code to write it, the
most natural place is inside setLastReconciledSpec() in BlueGreenUtils
- CRITICAL: I realized after recording lastReconciledSpec for the
suspended initial state, InitializingBlueStateHandler's deploy condition
becomes false for a SUSPENDED → RUNNING spec change (it only checks for null or
FAILING), causing noUpdate() to be returned and the deployment to never
trigger. The condition must be extended to handle this transition.
- Are you relying on the lifecycleState as well, because there’s no
equivalent behavior for this field for Blue/Green deployments. We can address
this gap separately if necessary.
Sergio
> On Mar 11, 2026, at 7:21 AM, Yanis Djeridi via dev <[email protected]>
> wrote:
>
> Hi everyone,
>
> I would like to start a discussion about FLINK-39243: Include
> observedGeneration for Suspended Flink Deployments [1].
>
> Currently, there are two gaps in how the Flink Kubernetes Operator handles
> observedGeneration, which violates Kubernetes API conventions and breaks
> integration with standard deployment tools (e.g., Kapp) that rely on
> observedGeneration to determine whether a controller has processed a spec
> change:
>
> FlinkDeployment: When created with spec.job.state: suspended, the operator
> returns early without updating any status fields, observedGeneration,
> lastReconciledSpec, and lifecycleState all remain unset.
>
> FlinkBlueGreenDeployment: The status schema does not include an
> observedGeneration field at all, so deployment tools can never determine
> whether the controller has processed a given generation.
>
> The proposed changes are:
>
> For FlinkDeployment: acknowledge the suspended spec by setting
> status.observedGeneration, recording lastReconciledSpec with state SUSPENDED,
> and setting lifecycleState to SUSPENDED, without deploying any Flink
> resources.
>
> For FlinkBlueGreenDeployment: add an observedGeneration field to the status
> class and record lastReconciledSpec when blocking on a suspended initial
> state.
>
> Looking forward to your feedback on the approach!
>
> [1] https://issues.apache.org/jira/browse/FLINK-39243
>
> Best Regards,
> Yanis Djeridi
>