Hello Sergio,

Thankyou for starting this discussion, I have a few questions.

> having 2 identical pipelines running side-by-side
How do you ensure correctness between the 2 identical pipelines? For
example, processing time semantics or late data can result in different
outputs.
Some Sources have consumption quotas, for example Kinesis Data Streams,
this may end up eating into this quota and cause problems.
How do we handle sources like Queues when they cannot be consumed
concurrently?

> we explore the idea of empowering the pipeline to decide, at the record
level, what data goes through and what doesn’t (by means of a “Gate”
component).
What happens if the Green job gets ahead of the Blue job? How will you pick
a stopping point (per channel) to stop at consistently (checkpoint barriers
solve this already, right?). If you use Point in (event) Time, you need to
handle idle channels.

> This new Controller will work in conjunction with a custom Flink Process
Function
Would this rely on the user including the operator in the job graph, or
would you somehow dynamically inject it?
Where would you put this operator in complex Job graphs or forests?

> A ConfigMap will be used to both contain the deployment parameters as
well as act as a communication conduit between the Controller and the two
Jobs.
Will a conflig map write/read cycle be fast enough? If you write into the
CM "stop at record x" then the other Job might read past record x before it
sees the command

> This function will listen for changes to the above mentioned ConfigMap
and each job will know if it’s meant to be Active or Standby, at record
level.
How will this work with exactly once sinks when you might have open
transactions?

Thanks,
Danny

On Mon, Nov 25, 2024 at 6:25 PM Sergio Chong Loo
<schong...@apple.com.invalid> wrote:

>
> Hi,
>
> As part of our work with Flink, we identified the need for a solution to
> have minimal “downtime” when re-deploying pipelines. This is especially the
> case when the startup times are high and lead to downtime that’s
> unacceptable for certain workloads.
>
> One solution we thought about is the notion of Blue/Green deployments
> which I conceptually describe in this email along with a potential list of
> modifications, the majority within the Flink K8s Operator.
>
> This is open for discussion, we’d love to get feedback from the community.
>
> Thanks!
> Sergio Chong
>
> *Proposal to add Blue/Green Deployments support to the Flink K8s Operator*
> Problem StatementAs stateful applications, Flink pipelines require the
> data stream flow to be halted during deployment operations. For some cases
> stopping this data flow could have a negative impact on downstream
> consumers. The following figure helps illustrate this:
>
> A Blue/Green deployment architecture, or Active/StandBy, can help overcome
> this issue by having 2 identical pipelines running side-by-side with only
> one (the Active) producing or outputting records at all times.
>
> [image: PastedGraphic-1.png]
>
> Moreover, since the pipelines are the ones “handling” each and every
> record, we explore the idea of empowering the pipeline to decide, at the
> record level, what data goes through and what doesn’t (by means of a “Gate”
> component).
> [image: PastedGraphic-2.png]
> Proposed Modifications
>
>    - Give the the Flink K8s Operator the capability of managing the
>    lifecycle of these deployments.
>    - A new CRD (e.g. FlinkBlueGreenDeployment) will be introduced
>    - A new Controller (e.g. FlinkBlueGreenDeploymentController) to manage
>    this CRD and hide from the user the details of the actual Blue/Green
>    (Active/StandBy) jobs.
>    - Delegate the lifecycle of the actual Jobs to the existing
>    FlinkDeployment controller.
>    - At all times the controller “knows” which deployment type is/will
>    become Active, from which *point in time.* This implies an Event
>    Watermark driven implementation but with an extensible architecture to
>    easily support other use cases.
>    - A ConfigMap will be used to both contain the deployment parameters
>    as well as act as a communication conduit between the Controller and the
>    two Jobs.
>    - This new Controller will work in conjunction with a custom Flink
>    Process Function (potentially called GateProcessFunction) on the
>    pipeline/client side. This function will listen for changes to the above
>    mentioned ConfigMap and each job will know if it’s meant to be Active or
>    Standby, at record level.
>
>
>
>
>

Reply via email to