Hello Sergio, Thankyou for starting this discussion, I have a few questions.
> having 2 identical pipelines running side-by-side How do you ensure correctness between the 2 identical pipelines? For example, processing time semantics or late data can result in different outputs. Some Sources have consumption quotas, for example Kinesis Data Streams, this may end up eating into this quota and cause problems. How do we handle sources like Queues when they cannot be consumed concurrently? > we explore the idea of empowering the pipeline to decide, at the record level, what data goes through and what doesn’t (by means of a “Gate” component). What happens if the Green job gets ahead of the Blue job? How will you pick a stopping point (per channel) to stop at consistently (checkpoint barriers solve this already, right?). If you use Point in (event) Time, you need to handle idle channels. > This new Controller will work in conjunction with a custom Flink Process Function Would this rely on the user including the operator in the job graph, or would you somehow dynamically inject it? Where would you put this operator in complex Job graphs or forests? > A ConfigMap will be used to both contain the deployment parameters as well as act as a communication conduit between the Controller and the two Jobs. Will a conflig map write/read cycle be fast enough? If you write into the CM "stop at record x" then the other Job might read past record x before it sees the command > This function will listen for changes to the above mentioned ConfigMap and each job will know if it’s meant to be Active or Standby, at record level. How will this work with exactly once sinks when you might have open transactions? Thanks, Danny On Mon, Nov 25, 2024 at 6:25 PM Sergio Chong Loo <schong...@apple.com.invalid> wrote: > > Hi, > > As part of our work with Flink, we identified the need for a solution to > have minimal “downtime” when re-deploying pipelines. This is especially the > case when the startup times are high and lead to downtime that’s > unacceptable for certain workloads. > > One solution we thought about is the notion of Blue/Green deployments > which I conceptually describe in this email along with a potential list of > modifications, the majority within the Flink K8s Operator. > > This is open for discussion, we’d love to get feedback from the community. > > Thanks! > Sergio Chong > > *Proposal to add Blue/Green Deployments support to the Flink K8s Operator* > Problem StatementAs stateful applications, Flink pipelines require the > data stream flow to be halted during deployment operations. For some cases > stopping this data flow could have a negative impact on downstream > consumers. The following figure helps illustrate this: > > A Blue/Green deployment architecture, or Active/StandBy, can help overcome > this issue by having 2 identical pipelines running side-by-side with only > one (the Active) producing or outputting records at all times. > > [image: PastedGraphic-1.png] > > Moreover, since the pipelines are the ones “handling” each and every > record, we explore the idea of empowering the pipeline to decide, at the > record level, what data goes through and what doesn’t (by means of a “Gate” > component). > [image: PastedGraphic-2.png] > Proposed Modifications > > - Give the the Flink K8s Operator the capability of managing the > lifecycle of these deployments. > - A new CRD (e.g. FlinkBlueGreenDeployment) will be introduced > - A new Controller (e.g. FlinkBlueGreenDeploymentController) to manage > this CRD and hide from the user the details of the actual Blue/Green > (Active/StandBy) jobs. > - Delegate the lifecycle of the actual Jobs to the existing > FlinkDeployment controller. > - At all times the controller “knows” which deployment type is/will > become Active, from which *point in time.* This implies an Event > Watermark driven implementation but with an extensible architecture to > easily support other use cases. > - A ConfigMap will be used to both contain the deployment parameters > as well as act as a communication conduit between the Controller and the > two Jobs. > - This new Controller will work in conjunction with a custom Flink > Process Function (potentially called GateProcessFunction) on the > pipeline/client side. This function will listen for changes to the above > mentioned ConfigMap and each job will know if it’s meant to be Active or > Standby, at record level. > > > > >