Hi @Daniel / @Ryan, Thanks a lot for the input. Similarly we’re swamped in a time crunch here but I’ll be taking a deep dive into your feedback hopefully before EoW.
Stay tuned! - Sergio > On Apr 15, 2026, at 3:57 PM, Daniel Rossos <[email protected]> wrote: > > Hey Sergio, > > I got around to running locally as well as doing a deeper dive into the > current implementation details (Sorry about the delay). This all looks super > awesome and huge thanks for taking the time to put this together. I have some > comments / questions below. Some I think will be answered as we test further > and some are potential next steps / features that might be nice for phase 2. > > Non-Idempotent sinks > I believe we will still have non-idempotent sinks issues by using these gate > functions. Brief recap of when this was raised before, records in the “green” > deployment could be produced before the “blue” deployment causing an > out-of-order delivery in record processing order which could have > implications for downstream sinks. One potential solution I was thinking > about that fits into this PR was to have the gate-operator be stateful and > have your “green” pipeline accumulate messages after watermark barrier and > wait for the “blue” to communicate (via the configmap) that it is done > processing before passing its records on. This would introduce a minimal > processing delay (poll rate of the “green” on configmap), but would ensure > that the ordering of the streams to down stream sinks remains consistent. > Other complications arise here with regards to handling state, but want to > get your opinion here. > > WatermarkGenerator > Inserting a watermark generator instead of requiring watermark in data. On > the SQL side of things I noticed it is required that a field be a timestamp > field that can be converted into a watermark. I was wondering if an > alternative would be to inject a watermark generator step (generate based off > some other value), that way existing table defs won’t need to be changed to > accommodate this new feature. > > Exactly-once-sinks > How does this work with exactly_once downstream sinks compatibility wise? For > example we should test using exactly_once kafka sinks to see if there will be > any conflicts there. > > Gate Parallelism > From my understanding, the parallelism of the gate operator has to be same as > the sink/source operator it is tied to. With autoscaling how do these stay in > sync? Could this cause problems? Can this gate become a performance > bottleneck somehow? > > TransitionMode Default > `transitionMode` not being set to default `BASIC` means all phase 1 > Blue-Green deployment would be broken specs on upgrade. I think we will want > to include that default > > Misc Testing > This is something for the future (and something I might play around with as I > test), is adding a case to the e2e-tests that creates a Flink pipeline (sql > and/or non-sql) and checks output from sink to ensure there are no > duplicates. I don’t have a convenient pipeline on hand > > Going forward, I am going to try to get a good test pipeline created and test > this on our sandbox environment to see if I can get a good e2e run on > prod-like conditions. > > Thanks again, > > Daniel > > On Wed, Apr 1, 2026 at 12:25 PM Sergio Chong Loo <[email protected] > <mailto:[email protected]>> wrote: >> Absolutely no rush, take your time. >> >> I’m also still working on some details around error handling. I’ll reach out >> offline so you can always work on the latest. >> >> Thank you both as well for the feedback! >> >> - Sergio >> >> >>> On Apr 1, 2026, at 8:16 AM, Daniel Rossos <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi Sergio, >>> >>> Sorry about the delay on my end (just got back after a few weeks off). I >>> have caught up with + agree with all the feedback Ryan provided in this >>> thread. >>> >>> The Gate auto-injection idea is really cool. I'm going to try and make some >>> time this week to test out your PR on my side and provide feedback and >>> thoughts. >>> >>> Thanks again for driving this, >>> Daniel >>> >>> >>> On Thu, Mar 26, 2026 at 11:13 AM Ryan van Huuksloot via dev >>> <[email protected] <mailto:[email protected]>> wrote: >>>> Awesome! Thanks for the update, Sergio. I'm excited to see the plan - it is >>>> a cool idea so I'm glad it is working. >>>> >>>> Ryan van Huuksloot >>>> Staff Engineer, Infrastructure | Streaming Platform >>>> [image: Shopify] >>>> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> >>>> >>>> >>>> On Thu, Mar 26, 2026 at 1:46 AM Sergio Chong Loo <[email protected] >>>> <mailto:[email protected]>> >>>> wrote: >>>> >>>> > @Ryan / @Daniel, >>>> > >>>> > Good news! The “GateInjectorPipelineExecutor” idea is successful!! >>>> > >>>> > While the original approach of simply activating it with >>>> > “execution.target” did not quite work, I was able to implement it via >>>> > the *Instrumentation >>>> > API with a Java Agent* that injects it… the user doesn’t have to touch >>>> > their pipelines and I added 2 options, at least for now, to place/inject >>>> > the Gate after the source or before the sink (complex DAG cases with >>>> > multiple sources or sinks for now are not supported). >>>> > >>>> > I’m documenting everything and prepping the Draft PR for your review, >>>> > probably a couple more days. >>>> > >>>> > Thanks, stay tuned. >>>> > >>>> > - Sergio >>>> > >>>> > >>>> > On Mar 16, 2026, at 3:30 PM, Sergio Chong Loo <[email protected] >>>> > <mailto:[email protected]>> wrote: >>>> > >>>> > Thanks for the ideas and the offer to help out Ryan! It’s invaluable to >>>> > learn about how other users/teams scenarios. >>>> > >>>> > Indeed I have to pursue and evaluate the GateInjectorExecutor nonetheless >>>> > for our internal development. Ideally it’d be great if the user can >>>> > simply >>>> > “invoke” the functionality, even give the user an option to specify >>>> > “where" >>>> > the gating mechanism to be placed (e.g. right after the source or before >>>> > sink), or for the most flexibility they can incorporate and place the >>>> > Gate >>>> > manually just like it is now. >>>> > >>>> > I’ll share the progress asap and we can all take it from there (this >>>> > should not exceed a couple weeks). I’ll definitely need more of your >>>> > feedback to verify this with Flink SQL. >>>> > >>>> > Thanks again, >>>> > Sergio >>>> > >>>> > >>>> > On Mar 16, 2026, at 7:01 AM, Ryan van Huuksloot < >>>> > [email protected] <mailto:[email protected]>> >>>> > wrote: >>>> > >>>> > Hi Sergio, >>>> > >>>> > re: 1.1 >>>> > My thought is that a BlueGreen Mixin isn't Kubernetes specific and could >>>> > be reused by other deployment control planes. However, I do agree that >>>> > attaching it to the sink has other implications so I am happy to pivot if >>>> > we can find an alternative solution. >>>> > >>>> > re: 2 >>>> > I'm happy to leave it out of the Phase 2 implementation, but I think it >>>> > should be possible. For example we use Phase 1 with cross cluster >>>> > migrations today. Phase 2 within a single cluster isn't particularly >>>> > useful >>>> > for us. >>>> > >>>> > re: GateInjectorExecutor >>>> > This sounds like a neat idea. I need to read more about how it would work >>>> > but from a high level, injecting an operator before your sinks sounds >>>> > like >>>> > a good idea. Better isolation, possible with SQL, no mixins, etc. >>>> > >>>> > I will mention that part of the reason I want it before the sinks is >>>> > because nine out of ten people building pipelines struggle to understand >>>> > where their state is and how Phase 2 would affect the correctness of >>>> > their >>>> > state depending on where they put the gate. I understand that if you >>>> > have a >>>> > remote lookup and want to save bandwidth, you could optimize your >>>> > pipeline >>>> > by moving the gate before the remote call; however, that seems like an >>>> > optimization that can be made later. >>>> > >>>> > Thanks for driving this! Let me know how we can help. >>>> > >>>> > Ryan van Huuksloot >>>> > Staff Engineer, Infrastructure | Streaming Platform >>>> > [image: Shopify] >>>> > <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> >>>> > >>>> > >>>> > On Mon, Mar 16, 2026 at 2:21 AM Sergio Chong Loo <[email protected] >>>> > <mailto:[email protected]>> >>>> > wrote: >>>> > >>>> >> Hi Ryan >>>> >> >>>> >> Thanks a lot for these details. For sure some of these observations >>>> >> popped up during our initial discussions, and that’s why our initial >>>> >> goal >>>> >> was to introduce this as simple as possible and gradually enhance it to >>>> >> cover gaps. >>>> >> >>>> >> Allow me to address your concerns: >>>> >> >>>> >> 1. I’m happy you stressed the point of “disruption to existing >>>> >> pipelines”. However, there’s a few points about attempting to build >>>> >> this >>>> >> functionality into the sinks (or sources) right off the bat (read >>>> >> further >>>> >> below for my alternative): >>>> >> 1. Kubernetes centric: as of now the Blue/Green Deployments >>>> >> support is a Kubernetes specific solution, adding a mixin directly >>>> >> available to sinks would “leak” this support outside of K8s >>>> >> 2. A sink being aware of these deployment phases violates single >>>> >> responsibility, but more importantly… >>>> >> 3. Flink currently has many connectors, with the majority being >>>> >> maintained outside of the Flink code base, by separate teams, >>>> >> separate >>>> >> repos, separate release cycles. This would complicate things >>>> >> significantly >>>> >> as to try and add support for this for every potential flink >>>> >> connector >>>> >> project out there would be a cumbersome. Blue/Green Phase 2 then >>>> >> only would >>>> >> works with "gate-aware" sinks. >>>> >> 2. I’d leave the conversation about migrating jobs between K8s >>>> >> clusters outside of this scope, even Phase 1 is meant to only work >>>> >> in a >>>> >> single cluster… >>>> >> 3. Watermarking, excellent point, it’s indeed a requirement so I’ll >>>> >> make sure this is validated where applicable (by the concrete >>>> >> implementation) >>>> >> >>>> >> >>>> >> Having said what I said about point 1.1 above, I’m currently working on >>>> >> an approach which uses a “GateInjectorPipelineExecutor” so to speak; in >>>> >> other words a custom PipelineExecutor that would be shipped with the K8s >>>> >> Operator, invoked by Flink Configuration (via “execution.target:”). This >>>> >> custom piece would instantiate and inject the Gate at a fixed point in >>>> >> the >>>> >> StreamGraph right before job submission. I still have to validate and >>>> >> ensure a few things are correctly taken care of (like Type Information, >>>> >> etc.) but the theory looks promising. >>>> >> >>>> >> For the most part this works well with Flink SQL (same configuration), >>>> >> here’s my estimation: >>>> >> >>>> >> tEnv.executeSql("INSERT INTO my_sink ...") >>>> >> └─> SQL planner → ExecNodeGraph → Transformation[] >>>> >> └─> StreamGraph >>>> >> └─> GateInjectorExecutor injects GateProcessFunction >>>> >> └─> StreamGraph' (mutated) → JobGraph >>>> >> └─> Submit Job >>>> >> >>>> >> I’m aiming to share some updates along these lines in the next few weeks >>>> >> but hopefully this falls inline with your objectives/thoughts overall. >>>> >> >>>> >> Sergio >>>> >> >>>> >> >>>> >> On Mar 6, 2026, at 3:36 PM, Ryan van Huuksloot via dev < >>>> >> [email protected] <mailto:[email protected]>> wrote: >>>> >> >>>> >> Hi Sergio, >>>> >> Thanks for starting this conversation. >>>> >> >>>> >> A few thoughts regarding BlueGreen Phase 2: >>>> >> 1. The Gate Operator is interesting but I don't like that we would have >>>> >> to >>>> >> modify users' pipelines for them to use Phase 2. This gate function >>>> >> seems >>>> >> like it could be a Mixin that connectors would implement. If you want to >>>> >> use Phase 2, your sinks must implement this Mixin. I understand that a >>>> >> unique GateFunction has pros, but it works less well with FlinkSQL - and >>>> >> the trade-off doesn't seem worthwhile. >>>> >> 2. Regarding the ConfigMap. We should consider a solution that supports >>>> >> migrating Flink jobs between Kubernetes clusters. Otherwise Phase 2 is >>>> >> only >>>> >> useful for in cluster operations. >>>> >> 3. Watermarking is a requirement. Will the Flink Kubernetes Operator >>>> >> validate that the pipeline is using watermarks? >>>> >> >>>> >> What happens when idleness is configured? Watermarks will get ignored >>>> >> from >>>> >> >>>> >> these “slow” subtasks and advance, could records from the ignored >>>> >> subtasks >>>> >> eventually be lost? >>>> >> Yes they would be lost, but that would happen irrespective of Phase 2. >>>> >> >>>> >> I'll have more thoughts after we discuss the Gate Operator, as that is >>>> >> crucial to the FLIP right now. >>>> >> >>>> >> Ryan van Huuksloot >>>> >> Staff Engineer, Infrastructure | Streaming Platform >>>> >> [image: Shopify] >>>> >> <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> >>>> >> >>>> >> >>>> >> On Mon, Mar 2, 2026 at 6:52 PM Sergio Chong Loo <[email protected] >>>> >> <mailto:[email protected]>> >>>> >> wrote: >>>> >> >>>> >> Bumping this (Advanced Blue/Green deployments - FLIP-504) thread after >>>> >> making some code adjustments. >>>> >> >>>> >> FYI @drossos <https://github.com/drossos> @ryanvanhuuksloot < >>>> >> https://github.com/ryanvanhuuksloot> I’d like to get your feedback since >>>> >> I know you’re interested in this feature. >>>> >> >>>> >> Thanks, >>>> >> - Sergio >>>> >> >>>> >> >>>> >> On Dec 5, 2025, at 2:31 PM, Sergio Chong Loo <[email protected] >>>> >> <mailto:[email protected]>> >>>> >> >>>> >> wrote: >>>> >> >>>> >> >>>> >> Hi folks, >>>> >> >>>> >> FLIP-503 (already merged) introduced the Basic Blue/Green Deployment >>>> >> >>>> >> functionality to the Flink K8s Operator. It was very straightforward, >>>> >> simply transitioning to the second deployment once it's considered >>>> >> stable. >>>> >> >>>> >> >>>> >> FLIP-504 is an Advanced version added on top of 503 and brings about the >>>> >> >>>> >> notion of "record-level" coordination between the 2 deployments to have >>>> >> no >>>> >> data duplication and exactly once semantics while preserving a smooth >>>> >> transition. >>>> >> >>>> >> >>>> >> The main goals are: >>>> >> • For the community to take a quick look at the current >>>> >> >>>> >> functionality (previously mentioned at the Flink Forward 2025 >>>> >> Conference) >>>> >> >>>> >> • To get feedback and improvement suggestions >>>> >> >>>> >> Flip 504 details: >>>> >> >>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650 >>>> >> >>>> >> >>>> >> Draft PR: https://github.com/apache/flink-kubernetes-operator/pull/1043 >>>> >> >>>> >> Thank you! >>>> >> - Sergio >>>> >> >>>> >> >>>> >
