Draft PR)

Ryan van Huuksloot via dev Thu, 26 Mar 2026 08:12:55 -0700

Awesome! Thanks for the update, Sergio. I'm excited to see the plan - it is
a cool idea so I'm glad it is working.


Ryan van Huuksloot
Staff Engineer, Infrastructure | Streaming Platform
[image: Shopify]
<https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>


On Thu, Mar 26, 2026 at 1:46 AM Sergio Chong Loo <[email protected]>
wrote:

> @Ryan / @Daniel,
>
> Good news! The “GateInjectorPipelineExecutor” idea is successful!!
>
> While the original approach of simply activating it with
> “execution.target” did not quite work, I was able to implement it via the 
> *Instrumentation
> API with a Java Agent* that injects it… the user doesn’t have to touch
> their pipelines and I added 2 options, at least for now, to place/inject
> the Gate after the source or before the sink (complex DAG cases with
> multiple sources or sinks for now are not supported).
>
> I’m documenting everything and prepping the Draft PR for your review,
> probably a couple more days.
>
> Thanks, stay tuned.
>
> - Sergio
>
>
> On Mar 16, 2026, at 3:30 PM, Sergio Chong Loo <[email protected]> wrote:
>
> Thanks for the ideas and the offer to help out Ryan! It’s invaluable to
> learn about how other users/teams scenarios.
>
> Indeed I have to pursue and evaluate the GateInjectorExecutor nonetheless
> for our internal development. Ideally it’d be great if the user can simply
> “invoke” the functionality, even give the user an option to specify “where"
> the gating mechanism to be placed (e.g. right after the source or before
> sink), or for the most flexibility they can incorporate and place the Gate
> manually just like it is now.
>
> I’ll share the progress asap and we can all take it from there (this
> should not exceed a couple weeks). I’ll definitely need more of your
> feedback to verify this with Flink SQL.
>
> Thanks again,
> Sergio
>
>
> On Mar 16, 2026, at 7:01 AM, Ryan van Huuksloot <
> [email protected]> wrote:
>
> Hi Sergio,
>
> re: 1.1
> My thought is that a BlueGreen Mixin isn't Kubernetes specific and could
> be reused by other deployment control planes. However, I do agree that
> attaching it to the sink has other implications so I am happy to pivot if
> we can find an alternative solution.
>
> re: 2
> I'm happy to leave it out of the Phase 2 implementation, but I think it
> should be possible. For example we use Phase 1 with cross cluster
> migrations today. Phase 2 within a single cluster isn't particularly useful
> for us.
>
> re: GateInjectorExecutor
> This sounds like a neat idea. I need to read more about how it would work
> but from a high level, injecting an operator before your sinks sounds like
> a good idea. Better isolation, possible with SQL, no mixins, etc.
>
> I will mention that part of the reason I want it before the sinks is
> because nine out of ten people building pipelines struggle to understand
> where their state is and how Phase 2 would affect the correctness of their
> state depending on where they put the gate. I understand that if you have a
> remote lookup and want to save bandwidth, you could optimize your pipeline
> by moving the gate before the remote call; however, that seems like an
> optimization that can be made later.
>
> Thanks for driving this! Let me know how we can help.
>
> Ryan van Huuksloot
> Staff Engineer, Infrastructure | Streaming Platform
> [image: Shopify]
> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>
>
> On Mon, Mar 16, 2026 at 2:21 AM Sergio Chong Loo <[email protected]>
> wrote:
>
>> Hi Ryan
>>
>> Thanks a lot for these details. For sure some of these observations
>> popped up during our initial discussions, and that’s why our initial goal
>> was to introduce this as simple as possible and gradually enhance it to
>> cover gaps.
>>
>> Allow me to address your concerns:
>>
>>    1. I’m happy you stressed the point of “disruption to existing
>>    pipelines”. However, there’s a few points about attempting to build this
>>    functionality into the sinks (or sources) right off the bat (read further
>>    below for my alternative):
>>       1. Kubernetes centric: as of now the Blue/Green Deployments
>>       support is a Kubernetes specific solution, adding a mixin directly
>>       available to sinks would “leak” this support outside of K8s
>>       2. A sink being aware of these deployment phases violates single
>>       responsibility, but more importantly…
>>       3. Flink currently has many connectors, with the majority being
>>       maintained outside of the Flink code base, by separate teams, separate
>>       repos, separate release cycles. This would complicate things 
>> significantly
>>       as to try and add support for this for every potential flink connector
>>       project out there would be a cumbersome. Blue/Green Phase 2 then only 
>> would
>>       works with "gate-aware" sinks.
>>    2. I’d leave the conversation about migrating jobs between K8s
>>    clusters outside of this scope, even Phase 1 is meant to only work in a
>>    single cluster…
>>    3. Watermarking, excellent point, it’s indeed a requirement so I’ll
>>    make sure this is validated where applicable (by the concrete
>>    implementation)
>>
>>
>> Having said what I said about point 1.1 above, I’m currently working on
>> an approach which uses a “GateInjectorPipelineExecutor” so to speak; in
>> other words a custom PipelineExecutor that would be shipped with the K8s
>> Operator, invoked by Flink Configuration (via “execution.target:”). This
>> custom piece would instantiate and inject the Gate at a fixed point in the
>> StreamGraph right before job submission. I still have to validate and
>> ensure a few things are correctly taken care of (like Type Information,
>> etc.) but the theory looks promising.
>>
>> For the most part this works well with Flink SQL (same configuration),
>> here’s my estimation:
>>
>> tEnv.executeSql("INSERT INTO my_sink ...")
>>     └─> SQL planner → ExecNodeGraph → Transformation[]
>>           └─> StreamGraph
>>                 └─> GateInjectorExecutor injects GateProcessFunction
>>                       └─> StreamGraph' (mutated) → JobGraph
>>                             └─> Submit Job
>>
>> I’m aiming to share some updates along these lines in the next few weeks
>> but hopefully this falls inline with your objectives/thoughts overall.
>>
>> Sergio
>>
>>
>> On Mar 6, 2026, at 3:36 PM, Ryan van Huuksloot via dev <
>> [email protected]> wrote:
>>
>> Hi Sergio,
>> Thanks for starting this conversation.
>>
>> A few thoughts regarding BlueGreen Phase 2:
>> 1. The Gate Operator is interesting but I don't like that we would have to
>> modify users' pipelines for them to use Phase 2. This gate function seems
>> like it could be a Mixin that connectors would implement. If you want to
>> use Phase 2, your sinks must implement this Mixin. I understand that a
>> unique GateFunction has pros, but it works less well with FlinkSQL - and
>> the trade-off doesn't seem worthwhile.
>> 2. Regarding the ConfigMap. We should consider a solution that supports
>> migrating Flink jobs between Kubernetes clusters. Otherwise Phase 2 is
>> only
>> useful for in cluster operations.
>> 3. Watermarking is a requirement. Will the Flink Kubernetes Operator
>> validate that the pipeline is using watermarks?
>>
>> What happens when idleness is configured? Watermarks will get ignored from
>>
>> these “slow” subtasks and advance, could records from the ignored subtasks
>> eventually be lost?
>> Yes they would be lost, but that would happen irrespective of Phase 2.
>>
>> I'll have more thoughts after we discuss the Gate Operator, as that is
>> crucial to the FLIP right now.
>>
>> Ryan van Huuksloot
>> Staff Engineer, Infrastructure | Streaming Platform
>> [image: Shopify]
>> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>
>>
>>
>> On Mon, Mar 2, 2026 at 6:52 PM Sergio Chong Loo <[email protected]>
>> wrote:
>>
>> Bumping this (Advanced Blue/Green deployments - FLIP-504) thread after
>> making some code adjustments.
>>
>> FYI @drossos <https://github.com/drossos> @ryanvanhuuksloot <
>> https://github.com/ryanvanhuuksloot> I’d like to get your feedback since
>> I know you’re interested in this feature.
>>
>> Thanks,
>> - Sergio
>>
>>
>> On Dec 5, 2025, at 2:31 PM, Sergio Chong Loo <[email protected]>
>>
>> wrote:
>>
>>
>> Hi folks,
>>
>> FLIP-503 (already merged) introduced the Basic Blue/Green Deployment
>>
>> functionality to the Flink K8s Operator. It was very straightforward,
>> simply transitioning to the second deployment once it's considered stable.
>>
>>
>> FLIP-504 is an Advanced version added on top of 503 and brings about the
>>
>> notion of "record-level" coordination between the 2 deployments to have no
>> data duplication and exactly once semantics while preserving a smooth
>> transition.
>>
>>
>> The main goals are:
>>    • For the community to take a quick look at the current
>>
>> functionality (previously mentioned at the Flink Forward 2025 Conference)
>>
>>    • To get feedback and improvement suggestions
>>
>> Flip 504 details:
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650
>>
>>
>> Draft PR: https://github.com/apache/flink-kubernetes-operator/pull/1043
>>
>> Thank you!
>> - Sergio
>>
>>
>

Re: FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP/Draft PR)

Reply via email to