Draft PR)

Sergio Chong Loo Thu, 02 Jul 2026 16:53:37 -0700

  Sorry for the late reply Vamshi, these are the right questions for a 
zero-loss/zero-dup rollout.


  One thing up front, because it answers most of these: the gate works at a 
single point in
  the job, right after a source or right before a sink. There it reads one 
watermark
  (whatever Flink has combined at that point) and checks each record's 
event-time once
  against a single cutover value. So it assumes a simple, linear spot where one 
watermark
  cleanly governs the records passing through. In short: the pipeline has to 
give the gate
  one event-time per record to compare against one cutover, once. Most of your 
questions are
  really "what if it isn't that simple," and in this Advanced mode (Phase 2) 
the answer is:
  place the gate on a stretch where it stays simple.

  1. Multiple sources / diverging watermarks. No special rule: the gate uses 
whatever
  watermark Flink has already combined at that point (for multiple inputs, the 
minimum across
  them), plus each record's own event-time. So it effectively waits for the 
slowest input; it
  doesn't track inputs individually. Jobs that need per-input logic should keep 
the gate on a
  single-watermark stretch.

  2. Idleness / lag. A slow-but-active source just holds the watermark back, so 
the cutover
  waits. That is safe (no loss) but can stall, which is why we are adding a 
deadline so a
  permanent stall fails loudly rather than hanging. Idleness is the sharp edge, 
and a real
  caveat on no-loss: a source marked idle is dropped from the current/minimum 
watermark
  (by design, for liveness), so the gate can cut over ahead of it. Records it 
emits afterward carry
  an event-time before the cutover, and once the outgoing deployment is torn 
down they have
  nowhere to land, so they are dropped. This is the same 
completeness-versus-liveness
  tradeoff idleness always makes, turned into a hard boundary by the cutover. 
Stated plainly:
  no-loss holds only if no source feeding the gate is idle during the 
transition. If your
  sources use idleness, treat no-loss as best-effort across the cutover. (A 
future gate
  strategy could prob. wait for per-source readiness, including idle sources, 
at the cost of
  re-introducing that stall).

  3. Mixed sources / event-time quality. Works for: event-time jobs with one 
solid watermark
  and a clear per-record event-time where the gate sits. Out of scope for Phase 
2: weak or
  absent event-time, uneven event-time quality across inputs, or anything that 
needs several
  watermarks combined.

  4. Sink guarantees. The gate makes a clean split (the old job emits up to the 
cutover, the
  new job after it), so normally each record goes out once. But the gate only 
coordinates the
  cutover; it can't take back a duplicate the sink already wrote. During the 
overlap, and
  especially if a job restarts and replays, true exactly-once still needs an 
idempotent or
  transactional sink. So I'd treat idempotent-or-transactional sinks as a 
requirement for the
  no-dup/no-loss claim. For Kafka, EOS transactions are the main backstop. The 
gate narrows
  the window; the sink closes it.

  5. Conformance matrix. Not yet: we unit-test the controller and the gate's 
record-level
  logic, but there's no end-to-end no-dup/no-loss test grid, and that's the 
real gap (and my
  next step, happy to build it with you). The grid would cross: number of 
sources, watermark
  divergence, idleness, where a failure happens, and sink type, each checked 
for duplicates
  and loss. Honest status: the single-source and idleness rows are the 
near-term ones; the
  failover + sink-type rows (the ones that matter most for a hard 
zero-loss/zero-dup bar) are
  the bigger lift and tie into the cutover-recovery work, so I wouldn't call 
failover
  correctness proven yet. Send the scenarios that matter to you and I'll add 
them.

  And none of the "out of scope" items are dead-ends. The watermark gate is 
just one strategy
  (bluegreen.gate.strategy=WATERMARK) on an extensible gate layer. The whole 
idea is that
  more complex cutover logic (multiple watermarks, per-input readiness, custom 
rules) can be
  added as a new gate implementation instead of complicating this one. So "out 
of scope for
  Phase 2" really means "a future strategy," and that's exactly where I'd 
welcome help.

  This is the pressure-testing Phase 2 needs. Much appreciated.

> Hi Sergio,
> Thanks for driving FLIP-504 forward. This is great and the top most need
> for Flink B/G cutovers. I am evaluating Phase 2 against our current
> production use cases, where zero-loss/zero-duplication during cutover is a
> hard requirement. I would appreciate a few clarifications on the intended
> implementation semantics.
> 
>   1. For multi-source jobs, if watermarks diverge, what is the intended
>   cutover rule (min-watermark, per-input readiness, or other)?
>   2. How should idleness/temporary lag on one source affect transition
>   readiness to avoid loss?
>   3. For mixed source types/event-time quality, what behavior is supported
>   vs out of scope in Phase 2?
>   4. For “no-dup/no-loss” expectations, what sink guarantees are assumed
>   (idempotent/transactional required or recommended)? For Kafka sinks
>   specifically, should EOS transactional mode be treated as the primary
>   backstop?
>   5. Is there a minimal conformance test matrix planned (divergence,
>   idleness, failover, duplicate/loss verification)? Any guidance here would
>   really help teams roll out safely with clear correctness boundaries.
> 
> Thanks,
> Vamshi


> On Jun 21, 2026, at 2:57 PM, Jing-Jia Hung <[email protected]> wrote:
> 
> Hi Sergio,
> 
> Late to the thread. The gate auto injection via the Java agent is cool.
> 
> Most of the discussion so far has centered on data-plane correctness. The
> area I'd like to understand better is the control-plane side, specifically
> recovery when a transition doesn't complete. The FLIP notes the controller
> can be left in a bad state if a deployment fails mid-transition. From
> reading the draft it looks like the teardown step waits on the gate
> signaling CLEAR_TO_TEARDOWN. If the gate stalls (watermark never advances,
> or the job fails before it signals), how do you picture the transition
> unwinding? Is there a deadline that forces a rollback, or is that part of
> the error handling still being worked out?
> 
> Also, I wonder how this approach would work with Pyflink DataStream jobs.
> My understanding is that Java agent may not be able to inject the gate in
> that case. Curious to hear your thoughts.
> 
> Thank you!
> Jing
> 
> 
> 
> On 2026/05/04 17:56:58 Sergio Chong Loo wrote: > Hi Daniel / Ryan, > >
> Again thanks a lot for the feedback. Here are some thoughts: > > The
> following are excellent points and I indeed think we can work on them
> immediately to incorporate them. I already have some ideas on
> implementation but let me flesh them out more before I share them: > -
> WatermarkGenerator > - Gate Parallelism > - TransitionMode default
> (trivial) > > Thanks for the Misc Testing proposal (especially with SQL),
> it will indeed facilitate development and make sure the outcome/output is
> precise. Let me know if/when you choose to start this effort (or any other
> item for that matter). > > Exactly-once sinks. This needs explicit testing
> indeed. I haven't verified this one yet. I believe the gate operator should
> ensure green's sink transactions don't begin until blue has completed its
> final checkpoint and committed. This is a good candidate for a dedicated
> test case before declaring phase 1 stable. > > Non-idempotent sinks. I'm
> still not sure this is a problem meant to be solved by Blue/Green
> deployments. If we think closely about it, even within a single Flink
> pipeline, strict global ordering is only guaranteed within a single
> subtask's partition. The moment you have parallelism > 1, different
> subtasks process different keys/partitions independently and emit records
> at different rates. There's no global wall-clock ordering across subtasks.
> Flink's watermark mechanism gives you event-time ordering, not
> arrival-order consistency across the whole topology. For pipelines where
> strict ordering to a non-idempotent sink is a hard requirement, the right
> pattern is a savepoint based stop/restart rather than concurrent blue/green
> execution. > > On the other hand, assuming we pursue this, your proposal
> sounds straightforward, but using state would imply the new pipeline needs
> to buffer an unbounded amount of data while waiting for the first
> deployment to finish, which makes memory management and state sizing very
> difficult. The gate watermark barrier already provides a temporal ordering
> guarantee (green doesn't advance past blue's watermark), which covers the
> majority of idempotency-sensitive sinks. Perhaps I'm looking at this from
> an overly simplistic perspective, is there a concrete example you can share
> to illustrate the scenario you have in mind? We should definitely keep this
> topic open as we make progress on the other items. > > I'll keep you posted
> on progress for the items at the top. > > ⁃ Sergio > > > > On Apr 20, 2026,
> at 8:40 AM, Sergio Chong Loo <[email protected]> wrote: > > > > Hi @Daniel /
> @Ryan, > > > > Thanks a lot for the input. Similarly we’re swamped in a
> time crunch here but I’ll be taking a deep dive into your feedback
> hopefully before EoW. > > > > Stay tuned! > > > > - Sergio > > > >> On Apr
> 15, 2026, at 3:57 PM, Daniel Rossos <[email protected]> wrote: > >> > >>
> Hey Sergio, > >> > >> I got around to running locally as well as doing a
> deeper dive into the current implementation details (Sorry about the
> delay). This all looks super awesome and huge thanks for taking the time to
> put this together. I have some comments / questions below. Some I think
> will be answered as we test further and some are potential next steps /
> features that might be nice for phase 2. > >> > >> Non-Idempotent sinks >
>>> I believe we will still have non-idempotent sinks issues by using these
> gate functions. Brief recap of when this was raised before, records in the
> “green” deployment could be produced before the “blue” deployment causing
> an out-of-order delivery in record processing order which could have
> implications for downstream sinks. One potential solution I was thinking
> about that fits into this PR was to have the gate-operator be stateful and
> have your “green” pipeline accumulate messages after watermark barrier and
> wait for the “blue” to communicate (via the configmap) that it is done
> processing before passing its records on. This would introduce a minimal
> processing delay (poll rate of the “green” on configmap), but would ensure
> that the ordering of the streams to down stream sinks remains consistent.
> Other complications arise here with regards to handling state, but want to
> get your opinion here. > >> > >> WatermarkGenerator > >> Inserting a
> watermark generator instead of requiring watermark in data. On the SQL side
> of things I noticed it is required that a field be a timestamp field that
> can be converted into a watermark. I was wondering if an alternative would
> be to inject a watermark generator step (generate based off some other
> value), that way existing table defs won’t need to be changed to
> accommodate this new feature. > >> > >> Exactly-once-sinks > >> How does
> this work with exactly_once downstream sinks compatibility wise? For
> example we should test using exactly_once kafka sinks to see if there will
> be any conflicts there. > >> > >> Gate Parallelism > >> From my
> understanding, the parallelism of the gate operator has to be same as the
> sink/source operator it is tied to. With autoscaling how do these stay in
> sync? Could this cause problems? Can this gate become a performance
> bottleneck somehow? > >> > >> TransitionMode Default > >> `transitionMode`
> not being set to default `BASIC` means all phase 1 Blue-Green deployment
> would be broken specs on upgrade. I think we will want to include that
> default > >> > >> Misc Testing > >> This is something for the future (and
> something I might play around with as I test), is adding a case to the
> e2e-tests that creates a Flink pipeline (sql and/or non-sql) and checks
> output from sink to ensure there are no duplicates. I don’t have a
> convenient pipeline on hand > >> > >> Going forward, I am going to try to
> get a good test pipeline created and test this on our sandbox environment
> to see if I can get a good e2e run on prod-like conditions. > >> > >>
> Thanks again, > >> > >> Daniel > >> > >> On Wed, Apr 1, 2026 at 12:25 PM
> Sergio Chong Loo <[email protected] <[email protected]>> wrote: > >>>
> Absolutely no rush, take your time. > >>> > >>> I’m also still working on
> some details around error handling. I’ll reach out offline so you can
> always work on the latest. > >>> > >>> Thank you both as well for the
> feedback! > >>> > >>> - Sergio > >>> > >>> > >>>> On Apr 1, 2026, at
> 8:16 AM, Daniel Rossos <[email protected] <[email protected]>>
> wrote: > >>>> > >>>> Hi Sergio, > >>>> > >>>> Sorry about the delay on my
> end (just got back after a few weeks off). I have caught up with + agree
> with all the feedback Ryan provided in this thread. > >>>> > >>>> The Gate
> auto-injection idea is really cool. I'm going to try and make some time
> this week to test out your PR on my side and provide feedback and thoughts.
>>>>>>>>>>> Thanks again for driving this, > >>>> Daniel > >>>> > >>>> >
>>>>> On Thu, Mar 26, 2026 at 11:13 AM Ryan van Huuksloot via dev <
> [email protected] <[email protected]>> wrote: > >>>>> Awesome!
> Thanks for the update, Sergio. I'm excited to see the plan - it is > >>>>>
> a cool idea so I'm glad it is working. > >>>>> > >>>>> Ryan van Huuksloot >
>>>>>> Staff Engineer, Infrastructure | Streaming Platform > >>>>> [image:
> Shopify] > >>>>> <
> https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> >
>>>>>>>>>>>>>>>>>> On Thu, Mar 26, 2026 at 1:46 AM Sergio Chong Loo <
> [email protected] <[email protected]>> > >>>>> wrote: > >>>>> > >>>>> >
> @Ryan / @Daniel, > >>>>> > > >>>>> > Good news! The
> “GateInjectorPipelineExecutor” idea is successful!! > >>>>> > > >>>>> >
> While the original approach of simply activating it with > >>>>> >
> “execution.target” did not quite work, I was able to implement it via the
> *Instrumentation > >>>>> > API with a Java Agent* that injects it… the user
> doesn’t have to touch > >>>>> > their pipelines and I added 2 options, at
> least for now, to place/inject > >>>>> > the Gate after the source or
> before the sink (complex DAG cases with > >>>>> > multiple sources or sinks
> for now are not supported). > >>>>> > > >>>>> > I’m documenting everything
> and prepping the Draft PR for your review, > >>>>> > probably a couple more
> days. > >>>>> > > >>>>> > Thanks, stay tuned. > >>>>> > > >>>>> > - Sergio
>>>>>>>>>>>>>>>>>>>>>> On Mar 16, 2026, at 3:30 PM, Sergio Chong Loo
> <[email protected] <[email protected]>> wrote: > >>>>> > > >>>>> > Thanks
> for the ideas and the offer to help out Ryan! It’s invaluable to > >>>>> >
> learn about how other users/teams scenarios. > >>>>> > > >>>>> > Indeed I
> have to pursue and evaluate the GateInjectorExecutor nonetheless > >>>>> >
> for our internal development. Ideally it’d be great if the user can simply
>>>>>>>> “invoke” the functionality, even give the user an option to
> specify “where" > >>>>> > the gating mechanism to be placed (e.g. right
> after the source or before > >>>>> > sink), or for the most flexibility
> they can incorporate and place the Gate > >>>>> > manually just like it is
> now. > >>>>> > > >>>>> > I’ll share the progress asap and we can all take
> it from there (this > >>>>> > should not exceed a couple weeks). I’ll
> definitely need more of your > >>>>> > feedback to verify this with Flink
> SQL. > >>>>> > > >>>>> > Thanks again, > >>>>> > Sergio > >>>>> > > >>>>> >
>>>>>>>> On Mar 16, 2026, at 7:01 AM, Ryan van Huuksloot < > >>>>> >
> [email protected] <[email protected]>> wrote: > >>>>> > > >>>>>
>> Hi Sergio, > >>>>> > > >>>>> > re: 1.1 > >>>>> > My thought is that a
> BlueGreen Mixin isn't Kubernetes specific and could > >>>>> > be reused by
> other deployment control planes. However, I do agree that > >>>>> >
> attaching it to the sink has other implications so I am happy to pivot if >
>>>>>>> we can find an alternative solution. > >>>>> > > >>>>> > re: 2 >
>>>>>>> I'm happy to leave it out of the Phase 2 implementation, but I
> think it > >>>>> > should be possible. For example we use Phase 1 with
> cross cluster > >>>>> > migrations today. Phase 2 within a single cluster
> isn't particularly useful > >>>>> > for us. > >>>>> > > >>>>> > re:
> GateInjectorExecutor > >>>>> > This sounds like a neat idea. I need to read
> more about how it would work > >>>>> > but from a high level, injecting an
> operator before your sinks sounds like > >>>>> > a good idea. Better
> isolation, possible with SQL, no mixins, etc. > >>>>> > > >>>>> > I will
> mention that part of the reason I want it before the sinks is > >>>>> >
> because nine out of ten people building pipelines struggle to understand >
>>>>>>> where their state is and how Phase 2 would affect the correctness
> of their > >>>>> > state depending on where they put the gate. I understand
> that if you have a > >>>>> > remote lookup and want to save bandwidth, you
> could optimize your pipeline > >>>>> > by moving the gate before the remote
> call; however, that seems like an > >>>>> > optimization that can be made
> later. > >>>>> > > >>>>> > Thanks for driving this! Let me know how we can
> help. > >>>>> > > >>>>> > Ryan van Huuksloot > >>>>> > Staff Engineer,
> Infrastructure | Streaming Platform > >>>>> > [image: Shopify] > >>>>> > <
> https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> >
>>>>>>>>>>>>>>>>>>>>> On Mon, Mar 16, 2026 at 2:21 AM Sergio Chong
> Loo <[email protected] <[email protected]>> > >>>>> > wrote: > >>>>> > >
>>>>>>>> Hi Ryan > >>>>> >> > >>>>> >> Thanks a lot for these details. For
> sure some of these observations > >>>>> >> popped up during our initial
> discussions, and that’s why our initial goal > >>>>> >> was to introduce
> this as simple as possible and gradually enhance it to > >>>>> >> cover
> gaps. > >>>>> >> > >>>>> >> Allow me to address your concerns: > >>>>> >> >
>>>>>>>> 1. I’m happy you stressed the point of “disruption to existing >
>>>>>>>> pipelines”. However, there’s a few points about attempting to
> build this > >>>>> >> functionality into the sinks (or sources) right off
> the bat (read further > >>>>> >> below for my alternative): > >>>>> >> 1.
> Kubernetes centric: as of now the Blue/Green Deployments > >>>>> >> support
> is a Kubernetes specific solution, adding a mixin directly > >>>>> >>
> available to sinks would “leak” this support outside of K8s > >>>>> >> 2. A
> sink being aware of these deployment phases violates single > >>>>> >>
> responsibility, but more importantly… > >>>>> >> 3. Flink currently has
> many connectors, with the majority being > >>>>> >> maintained outside of
> the Flink code base, by separate teams, separate > >>>>> >> repos, separate
> release cycles. This would complicate things significantly > >>>>> >> as to
> try and add support for this for every potential flink connector > >>>>> >>
> project out there would be a cumbersome. Blue/Green Phase 2 then only would
>>>>>>>>> works with "gate-aware" sinks. > >>>>> >> 2. I’d leave the
> conversation about migrating jobs between K8s > >>>>> >> clusters outside
> of this scope, even Phase 1 is meant to only work in a > >>>>> >> single
> cluster… > >>>>> >> 3. Watermarking, excellent point, it’s indeed a
> requirement so I’ll > >>>>> >> make sure this is validated where applicable
> (by the concrete > >>>>> >> implementation) > >>>>> >> > >>>>> >> > >>>>>
>>> Having said what I said about point 1.1 above, I’m currently working on
>>>>>>>>> an approach which uses a “GateInjectorPipelineExecutor” so to
> speak; in > >>>>> >> other words a custom PipelineExecutor that would be
> shipped with the K8s > >>>>> >> Operator, invoked by Flink Configuration
> (via “execution.target:”). This > >>>>> >> custom piece would instantiate
> and inject the Gate at a fixed point in the > >>>>> >> StreamGraph right
> before job submission. I still have to validate and > >>>>> >> ensure a few
> things are correctly taken care of (like Type Information, > >>>>> >> etc.)
> but the theory looks promising. > >>>>> >> > >>>>> >> For the most part
> this works well with Flink SQL (same configuration), > >>>>> >> here’s my
> estimation: > >>>>> >> > >>>>> >> tEnv.executeSql("INSERT INTO my_sink
> ...") > >>>>> >> └─> SQL planner → ExecNodeGraph → Transformation[] > >>>>>
>>> └─> StreamGraph > >>>>> >> └─> GateInjectorExecutor injects
> GateProcessFunction > >>>>> >> └─> StreamGraph' (mutated) → JobGraph >
>>>>>>>> └─> Submit Job > >>>>> >> > >>>>> >> I’m aiming to share some
> updates along these lines in the next few weeks > >>>>> >> but hopefully
> this falls inline with your objectives/thoughts overall. > >>>>> >> > >>>>>
>>> Sergio > >>>>> >> > >>>>> >> > >>>>> >> On Mar 6, 2026, at 3:36 PM, Ryan
> van Huuksloot via dev < > >>>>> >> [email protected] <
> [email protected]>> wrote: > >>>>> >> > >>>>> >> Hi Sergio, > >>>>> >>
> Thanks for starting this conversation. > >>>>> >> > >>>>> >> A few thoughts
> regarding BlueGreen Phase 2: > >>>>> >> 1. The Gate Operator is interesting
> but I don't like that we would have to > >>>>> >> modify users' pipelines
> for them to use Phase 2. This gate function seems > >>>>> >> like it could
> be a Mixin that connectors would implement. If you want to > >>>>> >> use
> Phase 2, your sinks must implement this Mixin. I understand that a > >>>>>
>>> unique GateFunction has pros, but it works less well with FlinkSQL - and
>>>>>>>>> the trade-off doesn't seem worthwhile. > >>>>> >> 2. Regarding
> the ConfigMap. We should consider a solution that supports > >>>>> >>
> migrating Flink jobs between Kubernetes clusters. Otherwise Phase 2 is >
>>>>>>>> only > >>>>> >> useful for in cluster operations. > >>>>> >> 3.
> Watermarking is a requirement. Will the Flink Kubernetes Operator > >>>>>
>>> validate that the pipeline is using watermarks? > >>>>> >> > >>>>> >>
> What happens when idleness is configured? Watermarks will get ignored from
>>>>>>>>>>>>>>>>> these “slow” subtasks and advance, could records from
> the ignored subtasks > >>>>> >> eventually be lost? > >>>>> >> Yes they
> would be lost, but that wo [message truncated...]

Re: FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP/Draft PR)

Reply via email to