[PR] [FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism [flink-kubernetes-operator]

via GitHub Mon, 26 Aug 2024 01:54:16 -0700


gyfora opened a new pull request, #871:
URL: https://github.com/apache/flink-kubernetes-operator/pull/871


   ## What is the purpose of the change
   
   Rework the last-state upgrade mode to not be solely reliant on HA metadata 
but to be flexible and use the job cancel mechanism in other cases. This change 
also allows the session jobs to use last-state upgrade mode where HA metadata 
is not accessible the same way as for Application clusters.
   
   ### Last state upgrades using cancel
   
   Currently last-state upgrade mode relies purely on HA metadata that is 
available for application deployments to simulate a failover during upgrade and 
make the JM pick up the correct last state automatically. This has a couple 
limitations, first and foremost is that it is not applicable to session jobs.
   
   With this PR we introduce a new mechanism for last-state upgrades of 
non-terminal jobs (the terminal case is already covered by existing mechanisms):
   
   1. Cancel the job through rest API (async operation)
   2. Wait until the job cancellation completes and the job becomes CANCELLED 
(terminal state)
   3. Observe last state information through REST API and use that for upgrade 
(upgrade flow already there for terminal jobs) 
   
   This new mechanism is similar to what a human operator would do for these 
jobs and does not rely on HA metadata and works for both application and 
session jobs and also in cases where HA metadata is not usable otherwise such 
as during version upgrades, or if HA is disabled etc.
   
   ### Changes to the reconciliation flow for correct cancellation during 
upgrades
   
   Currently the async nature of cancellation is not handled correctly in the 
reconciler even though session jobs use this to cancel jobs which can lead to 
in extreme cases 2 parallel jobs running on the same cluster.
   
   To handle this, the reconciler now explicitly checks for cancelling state 
and does not perform other upgrade actions until that completes. Also after 
initiating an async cancel action through the REST API we immediately exit and 
re-schedule the observation to wait until the cancellation completes and we can 
observe the last state of the cluster.
   
   The observer now recognises the CANCELLING state also as special user 
initiated action and when the job becomes CANCELLED (or not found in case of 
session jobs) it marks it explicitly SUSPENDED. This means that the reconciler 
will always resumes it subsequently, eliminating a risk of ending up with a 
cancelled job if the spec change was rolled back in the meantime.
   
   ### Refactored and improved FlinkService cancel methods
   
   To eliminate duplicate logic and overall reduce complexity the cancel 
application / session jobs methods have been refactored to re-use the common 
parts. Also a significant portion of the logic has been removed by separating 
the suspend and restore (upgrade) mechanism.
   
   The `JobUpgrade` utility class now encapsulates the necessary suspend and 
restore mechanism for the stateful upgrade depending on the current observed 
state and also. This allows us to better handle cases of async cancellation 
(SuspendMode.CANCEL) or if the job is already cancelled (or in terminal state) 
do nothing (SuspendMode.NOOP) and simply perform the restore.
   
   ### Misc session job changes / fixes
   
   In addition to making last-state upgrade mode generally available for 
session jobs this PR includes several critical fixes to the core upgrade 
cleanup logic as a result of this work such as:
   
    - Improved cleanup method that correctly waits until the job is fully 
cancelled instead of deleting the CR too early (risk of leaving the job there)
    - Call observe during cancel for session jobs for correct behaviour
    - Use correct job config generation for session jobs similar to 
applications, such as retaining checkpoints during cancellation by default 
which is needed for the above cancel mechanism
   
   ### Other changes / improvements as an outcome
   
   - Remove last-state upgrade limitations for apps and use cancel in these 
cases (flink version upgrade for non-running jobs, jobs without HA enabled)
   
   ## Verifying this change
   
    - Existing unit and E2Es guard the current behaviour
    - New unit tests have been added to cover the session job last-state 
upgrades and the improved observe, reconcile, cleanup flow
    - Extensive manual testing on local kubernetes
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? [TODO]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-35414] Rework last-state upgrade mode to support job cancellation as suspend mechanism [flink-kubernetes-operator]

Reply via email to