Draft PR)

Sergio Chong Loo Wed, 01 Apr 2026 09:25:23 -0700

Absolutely no rush, take your time.

I’m also still working on some details around error handling. I’ll reach out 
offline so you can always work on the latest.


Thank you both as well for the feedback!

- Sergio


> On Apr 1, 2026, at 8:16 AM, Daniel Rossos <[email protected]> wrote:
> 
> Hi Sergio,
> 
> Sorry about the delay on my end (just got back after a few weeks off). I have 
> caught up with + agree with all the feedback Ryan provided in this thread.
> 
> The Gate auto-injection idea is really cool. I'm going to try and make some 
> time this week to test out your PR on my side and provide feedback and 
> thoughts.
> 
> Thanks again for driving this,
> Daniel
> 
> 
> On Thu, Mar 26, 2026 at 11:13 AM Ryan van Huuksloot via dev 
> <[email protected] <mailto:[email protected]>> wrote:
>> Awesome! Thanks for the update, Sergio. I'm excited to see the plan - it is
>> a cool idea so I'm glad it is working.
>> 
>> Ryan van Huuksloot
>> Staff Engineer, Infrastructure | Streaming Platform
>> [image: Shopify]
>> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>> 
>> 
>> On Thu, Mar 26, 2026 at 1:46 AM Sergio Chong Loo <[email protected] 
>> <mailto:[email protected]>>
>> wrote:
>> 
>> > @Ryan / @Daniel,
>> >
>> > Good news! The “GateInjectorPipelineExecutor” idea is successful!!
>> >
>> > While the original approach of simply activating it with
>> > “execution.target” did not quite work, I was able to implement it via the 
>> > *Instrumentation
>> > API with a Java Agent* that injects it… the user doesn’t have to touch
>> > their pipelines and I added 2 options, at least for now, to place/inject
>> > the Gate after the source or before the sink (complex DAG cases with
>> > multiple sources or sinks for now are not supported).
>> >
>> > I’m documenting everything and prepping the Draft PR for your review,
>> > probably a couple more days.
>> >
>> > Thanks, stay tuned.
>> >
>> > - Sergio
>> >
>> >
>> > On Mar 16, 2026, at 3:30 PM, Sergio Chong Loo <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> >
>> > Thanks for the ideas and the offer to help out Ryan! It’s invaluable to
>> > learn about how other users/teams scenarios.
>> >
>> > Indeed I have to pursue and evaluate the GateInjectorExecutor nonetheless
>> > for our internal development. Ideally it’d be great if the user can simply
>> > “invoke” the functionality, even give the user an option to specify “where"
>> > the gating mechanism to be placed (e.g. right after the source or before
>> > sink), or for the most flexibility they can incorporate and place the Gate
>> > manually just like it is now.
>> >
>> > I’ll share the progress asap and we can all take it from there (this
>> > should not exceed a couple weeks). I’ll definitely need more of your
>> > feedback to verify this with Flink SQL.
>> >
>> > Thanks again,
>> > Sergio
>> >
>> >
>> > On Mar 16, 2026, at 7:01 AM, Ryan van Huuksloot <
>> > [email protected] <mailto:[email protected]>> 
>> > wrote:
>> >
>> > Hi Sergio,
>> >
>> > re: 1.1
>> > My thought is that a BlueGreen Mixin isn't Kubernetes specific and could
>> > be reused by other deployment control planes. However, I do agree that
>> > attaching it to the sink has other implications so I am happy to pivot if
>> > we can find an alternative solution.
>> >
>> > re: 2
>> > I'm happy to leave it out of the Phase 2 implementation, but I think it
>> > should be possible. For example we use Phase 1 with cross cluster
>> > migrations today. Phase 2 within a single cluster isn't particularly useful
>> > for us.
>> >
>> > re: GateInjectorExecutor
>> > This sounds like a neat idea. I need to read more about how it would work
>> > but from a high level, injecting an operator before your sinks sounds like
>> > a good idea. Better isolation, possible with SQL, no mixins, etc.
>> >
>> > I will mention that part of the reason I want it before the sinks is
>> > because nine out of ten people building pipelines struggle to understand
>> > where their state is and how Phase 2 would affect the correctness of their
>> > state depending on where they put the gate. I understand that if you have a
>> > remote lookup and want to save bandwidth, you could optimize your pipeline
>> > by moving the gate before the remote call; however, that seems like an
>> > optimization that can be made later.
>> >
>> > Thanks for driving this! Let me know how we can help.
>> >
>> > Ryan van Huuksloot
>> > Staff Engineer, Infrastructure | Streaming Platform
>> > [image: Shopify]
>> > <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>> >
>> >
>> > On Mon, Mar 16, 2026 at 2:21 AM Sergio Chong Loo <[email protected] 
>> > <mailto:[email protected]>>
>> > wrote:
>> >
>> >> Hi Ryan
>> >>
>> >> Thanks a lot for these details. For sure some of these observations
>> >> popped up during our initial discussions, and that’s why our initial goal
>> >> was to introduce this as simple as possible and gradually enhance it to
>> >> cover gaps.
>> >>
>> >> Allow me to address your concerns:
>> >>
>> >>    1. I’m happy you stressed the point of “disruption to existing
>> >>    pipelines”. However, there’s a few points about attempting to build 
>> >> this
>> >>    functionality into the sinks (or sources) right off the bat (read 
>> >> further
>> >>    below for my alternative):
>> >>       1. Kubernetes centric: as of now the Blue/Green Deployments
>> >>       support is a Kubernetes specific solution, adding a mixin directly
>> >>       available to sinks would “leak” this support outside of K8s
>> >>       2. A sink being aware of these deployment phases violates single
>> >>       responsibility, but more importantly…
>> >>       3. Flink currently has many connectors, with the majority being
>> >>       maintained outside of the Flink code base, by separate teams, 
>> >> separate
>> >>       repos, separate release cycles. This would complicate things 
>> >> significantly
>> >>       as to try and add support for this for every potential flink 
>> >> connector
>> >>       project out there would be a cumbersome. Blue/Green Phase 2 then 
>> >> only would
>> >>       works with "gate-aware" sinks.
>> >>    2. I’d leave the conversation about migrating jobs between K8s
>> >>    clusters outside of this scope, even Phase 1 is meant to only work in a
>> >>    single cluster…
>> >>    3. Watermarking, excellent point, it’s indeed a requirement so I’ll
>> >>    make sure this is validated where applicable (by the concrete
>> >>    implementation)
>> >>
>> >>
>> >> Having said what I said about point 1.1 above, I’m currently working on
>> >> an approach which uses a “GateInjectorPipelineExecutor” so to speak; in
>> >> other words a custom PipelineExecutor that would be shipped with the K8s
>> >> Operator, invoked by Flink Configuration (via “execution.target:”). This
>> >> custom piece would instantiate and inject the Gate at a fixed point in the
>> >> StreamGraph right before job submission. I still have to validate and
>> >> ensure a few things are correctly taken care of (like Type Information,
>> >> etc.) but the theory looks promising.
>> >>
>> >> For the most part this works well with Flink SQL (same configuration),
>> >> here’s my estimation:
>> >>
>> >> tEnv.executeSql("INSERT INTO my_sink ...")
>> >>     └─> SQL planner → ExecNodeGraph → Transformation[]
>> >>           └─> StreamGraph
>> >>                 └─> GateInjectorExecutor injects GateProcessFunction
>> >>                       └─> StreamGraph' (mutated) → JobGraph
>> >>                             └─> Submit Job
>> >>
>> >> I’m aiming to share some updates along these lines in the next few weeks
>> >> but hopefully this falls inline with your objectives/thoughts overall.
>> >>
>> >> Sergio
>> >>
>> >>
>> >> On Mar 6, 2026, at 3:36 PM, Ryan van Huuksloot via dev <
>> >> [email protected] <mailto:[email protected]>> wrote:
>> >>
>> >> Hi Sergio,
>> >> Thanks for starting this conversation.
>> >>
>> >> A few thoughts regarding BlueGreen Phase 2:
>> >> 1. The Gate Operator is interesting but I don't like that we would have to
>> >> modify users' pipelines for them to use Phase 2. This gate function seems
>> >> like it could be a Mixin that connectors would implement. If you want to
>> >> use Phase 2, your sinks must implement this Mixin. I understand that a
>> >> unique GateFunction has pros, but it works less well with FlinkSQL - and
>> >> the trade-off doesn't seem worthwhile.
>> >> 2. Regarding the ConfigMap. We should consider a solution that supports
>> >> migrating Flink jobs between Kubernetes clusters. Otherwise Phase 2 is
>> >> only
>> >> useful for in cluster operations.
>> >> 3. Watermarking is a requirement. Will the Flink Kubernetes Operator
>> >> validate that the pipeline is using watermarks?
>> >>
>> >> What happens when idleness is configured? Watermarks will get ignored from
>> >>
>> >> these “slow” subtasks and advance, could records from the ignored subtasks
>> >> eventually be lost?
>> >> Yes they would be lost, but that would happen irrespective of Phase 2.
>> >>
>> >> I'll have more thoughts after we discuss the Gate Operator, as that is
>> >> crucial to the FLIP right now.
>> >>
>> >> Ryan van Huuksloot
>> >> Staff Engineer, Infrastructure | Streaming Platform
>> >> [image: Shopify]
>> >> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>> >>
>> >>
>> >> On Mon, Mar 2, 2026 at 6:52 PM Sergio Chong Loo <[email protected] 
>> >> <mailto:[email protected]>>
>> >> wrote:
>> >>
>> >> Bumping this (Advanced Blue/Green deployments - FLIP-504) thread after
>> >> making some code adjustments.
>> >>
>> >> FYI @drossos <https://github.com/drossos> @ryanvanhuuksloot <
>> >> https://github.com/ryanvanhuuksloot> I’d like to get your feedback since
>> >> I know you’re interested in this feature.
>> >>
>> >> Thanks,
>> >> - Sergio
>> >>
>> >>
>> >> On Dec 5, 2025, at 2:31 PM, Sergio Chong Loo <[email protected] 
>> >> <mailto:[email protected]>>
>> >>
>> >> wrote:
>> >>
>> >>
>> >> Hi folks,
>> >>
>> >> FLIP-503 (already merged) introduced the Basic Blue/Green Deployment
>> >>
>> >> functionality to the Flink K8s Operator. It was very straightforward,
>> >> simply transitioning to the second deployment once it's considered stable.
>> >>
>> >>
>> >> FLIP-504 is an Advanced version added on top of 503 and brings about the
>> >>
>> >> notion of "record-level" coordination between the 2 deployments to have no
>> >> data duplication and exactly once semantics while preserving a smooth
>> >> transition.
>> >>
>> >>
>> >> The main goals are:
>> >>    • For the community to take a quick look at the current
>> >>
>> >> functionality (previously mentioned at the Flink Forward 2025 Conference)
>> >>
>> >>    • To get feedback and improvement suggestions
>> >>
>> >> Flip 504 details:
>> >>
>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650
>> >>
>> >>
>> >> Draft PR: https://github.com/apache/flink-kubernetes-operator/pull/1043
>> >>
>> >> Thank you!
>> >> - Sergio
>> >>
>> >>
>> >

Re: FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP/Draft PR)

Reply via email to